Back

Creating Books Vol 5: Word Lists

Most books available on the WordCruncher Bookstore are divided several word lists. This is to separate components of the text so that you can focus on the most important sections. For example, you're probably less likely to search within the text of footnotes unless that part is particularly interesting to you. You can switch word lists in the upper-right corner of the search window. When switching word lists, the WordWheel will update to show the words that are within that word list.

The TED Talk Corpus has these word lists:

  1. Text (only the text of the talks)
  2. Headings (titles and authors)
  3. Lemmas (the dictionary form and part of speech)
  4. Morphological features (such as tense and pronoun type)

The top word list, Text, is designated as the most important word list since it is generally the most used word list. Separating your text into different word lists is useful when you know some of the text is less useful for searching (or useful for different reasons).

Note: There are times when a user might want to search all of the wordlists at once, which is why there's an “All Text” option.

Lexicon Attributes

An individual word list can be defined with a <LEX/> tag within the <sifx> element. The order that you position the <LEX/> tags within the <sifx> determines the order within WordCruncher's search window wordlist menu. It's highly encouraged to place the most important lexicon (like “Text”) as the first tag.

The lexicon has many attributes, but these attributes are the most important.

Copy
Lexicon attribute code example

Style Name (Required)

<LEX st="Main Text"/>

The unique name for the text style.

Language Code (Required)

<LEX id="lang"/>

The two-letter code for the language you're using. View the list of supported languages below.

Note: for most purposes, the two-letter code will be sufficient. If needed, extended region codes can also be used. For example, the code en will set the language to English, but if you wish to specify English (United Kingdom), use the code en-GB.

LanguageCode
Afrikaansaf-ZA
Albaniansq-AL
Alsatiangsw-FR
Amharicam-ET
Arabicar
Arabic (Algeria)ar-DZ
Arabic (Bahrain)ar-BH
Arabic (Egypt)ar-EG
Arabic (Iraq)ar-IQ
Arabic (Jordan)ar-JO
Arabic (Kuwait)ar-KW
Arabic (Lebanon)ar-LB
Arabic (Libya)ar-LY
Arabic (Morocco)ar-MA
Arabic (Oman)ar-OM
Arabic (Qatar)ar-QA
Arabic (Saudi Arabia)ar-SA
Arabic (Syria)ar-SY
Arabic (Tunisia)ar-TN
Arabic (U.A.E.)ar-AE
Arabic (Yemen)ar-YE
Aramaicarc-IL
Aramaic (Samaria)arc-SY
Armenianhy-AM
Assameseas-IN
Azeriaz
Azeri-Latnaz-Latn-AZ
Azeri-Cyrlaz-Cyrl-AZ
Bashkirba-RU
Basqueeu-ES
Belarusianbe-BY
Bengalibn
Bengali (Bangladesh)bn-BD
Bengali (India)bn-IN
Bosnianbs
Bosnian-Latnbs-BA
Bosnian-Cyrlbs-Cyrl-BA
Bretonbr-FR
Bulgarianbg-BG
Catalanca-ES
Cebuanoceb-PH
Cherokeechr-US
Chinese-Hanszh-Hans
Chinese-Hans (PRChina)zh-CN
Chinese-Hans (Singapore)zh-SG
Chinese-Hantzh-Hant
Chinese-Hant (Hong Kong)zh-HK
Chinese-Hant (Macau)zh-MO
Chinese-Hant (Taiwan)zh-TW
Corsicanco-FR
Croatianhr
Croatian (Bosnia)hr-BA
Croatian (Croatia)hr-HR
Czechcs-CZ
Danishda-DK
Darifa-AF
Divehidv-MV
Dutchnl
Dutch (Belgium)nl-BE
Dutch (Netherlands)nl-NL
Eblaitex-eb-SY
Englishen
English (Australia)en-AU
English (Belize)en-BZ
English (Canada)en-CA
English (Caribbean)en-VI
English (India)en-IN
English (Ireland)en-IE
English (Jamaica)en-JM
English (Malaysia)en-MY
English (New Zealand)en-NZ
English (Philippines)en-PH
English (Singapore)en-SG
English (South Africa)en-ZA
English (Trinidad)en-TT
English (United Kingdom)en-GB
English (United States)en-US
English (Zimbabwe)en-ZW
Estonianet-EE
Faeroesefo-FO
Fijianfj-FJ
Filipinofil-PH
Finnishfi-FI
Frenchfr
French (Belgium)fr-BE
French (Canada)fr-CA
French (France)fr-FR
French (Luxembourg)fr-LU
French (Monaco)fr-MC
French (Switzerland)fr-CH
Frisianfy-NL
Fulah (Senegal)ff-SN
Galiciangl-ES
Georgianka-GE
Germande
German (Austria)de-AT
German (Germany)de-DE
German (Liechtenstein)de-LI
German (Luxembourg)de-LU
German (Switzerland)de-CH
Greekel-GR
Greek-Transliteratedel-x-Trns
Greenlandickl-GL
Gujaratigu-IN
Haitian Creoleht-HT
Hausaha-Latn-NG
Hawaiianhaw-US
Hebrewhe-IL
Hebrew-Transliteratedhe-x-Trns
Hindihi-IN
Hungarianhu-HU
Icelandicis-IS
Igboig-NG
Ilokanoilo-PH
Inari Samismn-FI
Indonesianid-ID
Inuktitutiu
Inuktitut-Latniu-Latn-CA
Inuktitut-Cansiu-Cans-CA
Irishga-IE
Italianit
Italian (Italy)it-IT
Italian (Switzerland)it-CH
Japaneseja-JP
K'ichequt-GT
Kannadakn-IN
Kashmiriks-IN
Kazakkk-KZ
Kekchikek-GT
Khmerkm-KH
Kinyarwandarw-RW
Konkanikok-IN
Koreanko-KR
Kurdishckb
Kurdish (Iraq)ku-IQ
Kyrgyzky-KG
Laolo-LA
Latinla
Latvianlv-LV
Lithuanianlt-LT
Lower Sorbiandsb-DE
Lule Samismj-NO
Lule Sami (Norway)smj-NO
Lule Sami (Sweden)smj-SE
Luxembourgishlb-LU
Macedonianmk-MK
Malagasymg-MG
Malayms
Malay (Brunei Darussalam)ms-BN
Malay (Malaysia)ms-MY
Malayalamml-IN
Maltesemt-IN
Manipurimni
Maorimi-NZ
Mapudungunarn-CL
Marathimr-IN
Mohawkmoh-CA
Mongolianmn
Mongolian-Cyrlmn-MN
Mongolian-Mongmn-Mong-CN
Nepaline
Nepali (India)ne-IN
Nepali (Nepal)ne-NP
Neutralroot
Northern Samise
Northern Sami (Finland)se-FI
Northern Sami (Norway)se-NO
Northern Sami (Sweden)se-SE
Norwegianno
Norwegian Bokmalnb-NO
Norwegian Nynorsknn-NO
Occitanoc-FR
Oriyaor-IN
Pashtops-AF
Persianfa-IR
Polishpl-PL
Portuguesept
Portuguese (Brazil)pt-BR
Portuguese (Portugal)pt-PT
Punjabipa-IN
Punjabi (India)pa-IN
Punjabi (Pakistan)pa-PK
Quechuaqu
Quechua (Bolivia)qu-BO
Quechua (Ecuador)qu-EC
Quechua (Peru)qu-PE
Romanianro-RO
Romanshrm-CH
Russianru-RU
Sakhasah-RU
Samoansm
Samoan (American Samoa)sm-AS
Samoan (Samoa)sm-WS
Sanskritsa-IN
Scottish Gaelicgd-GB
Serbiansr
Serbian (Bosnia)sr-Latn-BA
Serbian (Serbia)sr-Latn-RS
Serbian-Cyrl (Bosnia)sr-Cyrl-BA
Serbian-Cyrl (Serbia)sr-Cyrl-RS
Setswanatn
Setswana (Botswana)tn-BW
Setswana (Sourth Africa)tn-ZA
Sindhisd
Sindhi (India)sd-IN
Sindhi (Afghanistan)sd-AF
Sindhi (Pakistan)sd-PK
Sinhalasi-LK
Skolt Samisms-FI
Slovaksk-SK
Sloveniansl-SI
Sothosa Leboanso-ZA
Southern Samisma-NO
Southern Sami (Norway)sma-NO
Southern Sami (Sweden)sma-SE
Spanishes
Spanish (Argentina)es-AR
Spanish (Bolivia)es-BO
Spanish (Chile)es-CL
Spanish (Colombia)es-CO
Spanish (Costa Rica)es-CR
Spanish (Dominican Republic)es-DO
Spanish (Ecuador)es-EC
Spanish (El Salvador)es-SV
Spanish (Guatemala)es-GT
Spanish (Honduras)es-HN
Spanish (International)es-ES
Spanish (Mexico)es-MX
Spanish (Nicaragua)es-NI
Spanish (Panama)es-PA
Spanish (Paraguay)es-PY
Spanish (Peru)es-PE
Spanish (Puerto Rico)es-PR
Spanish (Traditional)es-x-Trad
Spanish (United States)es-US
Spanish (Uruguay)es-UY
Spanish (Venezuela)es-VE
Swahilisw-KE
Swedishsv
Swedish (Finland)sv-FI
Swedish (Sweden)sv-SE
Syriacsyr-SY
Tagalogtl-PH
Tajiktg-Cyrl-TJ
Tamazighttzm
Tamazight (Algeria)tzm-Latn-DZ
Tamazight (Morocco)tzm-Tfng-MA
Tamilta-IN
Tatartt-RU
Telugute-IN
Thaith-TH
Tibetanbo-CN
Tigrignati
Tigrigna (Eritrea)ti-ER
Tigrigna (Ethiopia)ti-ET
Tonganton-TO
Turkishtr-TR
Turkmentk-TM
Ukrainianuk-UA
Upper Sorbianwen-DE
Urduur-PK
Uzbekuz
Uzbek-Latnuz-Latn-UZ
Uzbek-Cyrluz-Cyrl-UZ
Uyghurug-CN
Vietnamesevi-VN
Welshcy-GB
Wolofwo-SN
Xhosaxh-ZA
Yiii-CN
Yorubayo-NG
Zuluzu-ZA

Break Characters (Recommended)

<LEX chrBrk="—…–"/>

List of break characters. These are characters that will break up a word during the indexing of your ETAX. This is particularly useful for characters like en-dashes or em-dashes.

No-break Characters (Recommended)

<LEX chrNobrk=":-"/>

List of no-break characters. These are characters that will not break up a word during the indexing of your ETAX. This is particularly useful for characters like colons or hyphens. For example, the text “Genesis 1:1” would keep “1:1” as one word if the colon is added to this attribute.

Related Articles