Part of Speech Documentation
Updated as of 24 August 2020.
WordCruncher has the ability to attach a part of speech code to each word and include it with the word information in the index (Word Wheel). These part of speech codes are built with predefined tags (see below for a chart of tags) which can be organized in a hierarchical manner. For instance, the part of speech code for a singular, common noun can be written with the three tags: “n.comm.sing”. While WordCruncher provides a rich set of predefined part of speech tags, there are occasions where they may be insufficient to fully describe a part of speech code. In these cases, some user defined tags can be added.
The Indexer does not currently tag texts for part of speech. It is recommended that you use an external part of speech tagger like Stanford NLP or TreeTagger to tag your texts. Other resources like NLTK, Stanza, or Spacy with Python are also useful for tagging texts. These part of speech taggers produce abbreviated markup codes like NN1 for singular nouns and NN2 for plural nouns. WordCruncher can also accept these markup codes for parts of speech, but an XML file called the EPOSX is needed to define how these markups are translated to WordCruncher part of speech codes.
Below are a few example EPOSX files for download. They can be used or modified to fit your needs.
ETAX Addition
To include an EPOSX file in an ETAX file, the attribute eposx should be added to the <ETAX> element where the value of the attribute is the filename of the EPOSX file.
Example:
<etax id="…" sifx="…" eposx="C5.eposx">
EPOSX Template
It is encouraged that you use one of the EPOSX files from our GitHub to create your custom EPOSX. However, here is the basic template needed for the file.
<eposx name="C5" wordSeparator="_" ambiguitySeparator="-">
<userTags>
<tag code="wh" name="Wh- Words" />
</userTags>
<markups>
<markup text="AJC" code="adj.comp" />
</markups>
</eposx>
EPOSX Attributes
Attribute | Example Values | Description |
---|---|---|
title | C5 C8 Stanza-English |
Give a name to the type of part of speech, preferably in reference to the name of the tagger used. |
wordSeparator | _ (Default) / |
The character that is used to mark a word with the part of speech. If an underscore is the word separator, then your text should look like “word_NN0”. |
ambiguitySeparator | - (Default) ? |
This character is used within part of speech codes. Taggers may provide two parts of speech if there is ambiguity. If a hyphen is used for the ambiguity separator, then your text should look like “swimming_NN0-VBG”. |
WordCruncher books are prepared with an underscore as the word separator and a hyphen as the ambiguity separator, as shown below:
<eposx title="C5" wordSeparator="_" ambiguitySeparator="-">
<eposx title="C8" wordSeparator="_" ambiguitySeparator="-">
Primary Part of Speech Tags
WordCruncher has predefined tags that must be the first tag listed in each part of speech code. We call these “Primary” tags. Primary tags can also be used as “Secondary” tags if needed. For example, “n.adv” can be used to mark an adverbial noun. Below are listed the Primary tags. Use the “Tag Code” when defining a part of speech code. The “Tag Name” is given merely for convenience and readability.
Tag Name | Tag Code |
---|---|
adjective | adj |
adverb | adv |
alphabet | alph |
article | art |
circumposition | circ |
classifier | clf |
clitic | clitic |
conjunction | conj |
determiner | det |
existential | exist |
interjection | interj |
noun | n |
null | null |
numeral | num |
other | oth |
particle | ptcl |
postposition | postp |
preposition | prep |
pronoun | pron |
punctuation | punct |
unclassified | uncl |
verb | v |
Secondary Part of Speech Tags
Secondary tags are categorized in this table by the grammatical category that are usually associated with them. However, they can be used as any category based on your part of speech schema. It is recommended that only one tag per category be used when defining a part of speech code. User tags can be added to this list of secondary tags if further defintion is required.
Tag Name | Tag Code |
---|---|
Verb Type | |
lexical | lex |
auxiliary | aux |
modal | mod |
semiauxiliary | semiaux |
Noun Type | |
common | comm |
proper | prop |
Common Noun Type | |
unit | unit |
direction | dir |
temporal | temp |
Action Noun Type | |
subject | sbj |
object | obj |
Adjective/Adverb Type | |
comparative | comp |
superlative | superl |
evaluative | eval |
positive | pos |
negative | neg |
attributive | attr |
predicative | pred |
degree | deg |
Numeral Type | |
cardinal | card |
ordinal | ord |
fraction | frac |
Conjunction Type | |
coordinating | coord |
subordinating | subord |
correlative | corr |
Pronoun/Determiner Type | |
definite | def |
indefinite | indef |
demonstrative | dem |
exclamative | excl |
interrogative | interrog |
personal | pers |
reflexive | refl |
irreflexive | irrefl |
relative | rel |
substitutive | substit |
Tense | |
conditional | cond |
future | fut |
past | past |
present | pres |
Aspect | |
imperfect | imperf |
perfect | perf |
progressive | prog |
aorist | aor |
pluperfect | pluperf |
Voice | |
active | act |
passive | pass |
middle | mid |
causative | caus |
Mood | |
indicative | indic |
imperative | imp |
subjunctive | subj |
optative | opt |
infinitive | inf |
finite | fin |
gerund | ger |
participle | ptcp |
Person | |
first person | 1 |
second person | 2 |
third persion | 3 |
Case | |
nominative | nom |
genitive | gen |
possessive | poss |
dative | dat |
accusative | acc |
locative | loc |
vocative | voc |
instrumental | instr |
absolutive | abs |
ergative | erg |
Gender | |
masculine | masc |
feminine | fem |
neuter | neut |
universal | univ |
Animacy | |
animate | anim |
inanimate | inanim |
human | hum |
nonhuman | nonhum |
Number | |
singular | sing |
plural | pl |
dual | dual |
Honorifics | |
formal | form |
informal | inform |
polite | pol |
royal | roy |
Abbreviation | |
abbreviation | abbr |
contraction | contr |
foreign | foreign |
headline | head |
title | title |
marker | mark |
pronominal | pronom |
truncated | trunc |
User-Created Part of Speech Tags
In the <userTags> section of an EPOSX, a user can add up to 64 tag elements that are not specified in secondary tags list. Codes must have no spaces, word separator character, or ambiguity separator character.
Individual tags need to be in an empty <tag/> element. There are two attributes for this tag: code and name.
TABLE
Below are some examples of tags that you can make.
<userTags>
<tag code="wh" name="Wh-words" />
<tag code="that" name="That" />
<tag code="of" name="Of" />
<tag code="be" name="Be verbs" />
<tag code="do" name="Do verbs" />
<tag code="have" name="Have verbs" />
</userTags>
Part of Speech Markups
The <markups> element is the XML element where all markups defined by external taggers are given WordCruncher part of speech code equivalents using the <markup/> element. This element has two required attributes: text and code.
TABLE
Example Text
With the EPOSX prepared correctly, the text in the ETAX needs to follow the character separators. An example paragraph should look like this:
<p>The_AT quick_JJ brown_JJ fox_NN1 jumped_VVD over_II the_AT lazy_JJ dog_NN1.</p>
Alternatively, a user can use the default WordCruncher markup for tagging text. If done this way, you won't need to add anything else to the <EPOSX>.
<p>The_art quick_adj brown_adj fox_n.sing jumped_v.past over_prep the_art lazy_adj dog_n.sing.</p>
Miscellaneous Notes
Available Features: Search features and the Neighborhood Report are available for part of speech books. Other reports, such as the Phrase Compare Report, will be updated later to use POS functionality.
Punctuation: Normal punctuation that is not defined in as a word separator or
ambiguity separator will not be used for parts of speech. That means that it’s okay to have
punctuation right after a part of speech tag without adding a zero space <zs/> tag before
punctuation.
For example, if a sentence ends with “dog_NN1.”, then the period at the end
will not be indexed as part of the POS tag.
If the word separator or ambiguity separator are part of words, then a <ch> tag should be wrapped around these characters. For example, my_function_NN0 will cause a problem because the underscore initiates the part of speech. Since function_NN0 is probably not the desired POS, this should be modified to my<ch>_</ch>function_NN0.
Attributes, Tag Words, Reference Levels: Any text that is in an attribute, tag word, or reference level will not be indexed for POS tags.