Part of Speech Documentation

Updated as of 24 August 2020.

WordCruncher has the ability to attach a part of speech code to each word and include it with the word information in the index (Word Wheel). These part of speech codes are built with predefined tags (see below for a chart of tags) which can be organized in a hierarchical manner. For instance, the part of speech code for a singular, common noun can be written with the three tags: “n.comm.sing”. While WordCruncher provides a rich set of predefined part of speech tags, there are occasions where they may be insufficient to fully describe a part of speech code. In these cases, some user defined tags can be added.

The Indexer does not currently tag texts for part of speech. It is recommended that you use an external part of speech tagger like Stanford NLP or TreeTagger to tag your texts. Other resources like NLTK, Stanza, or Spacy with Python are also useful for tagging texts. These part of speech taggers produce abbreviated markup codes like NN1 for singular nouns and NN2 for plural nouns. WordCruncher can also accept these markup codes for parts of speech, but an XML file called the EPOSX is needed to define how these markups are translated to WordCruncher part of speech codes.

Below are a few example EPOSX files for download. They can be used or modified to fit your needs.

C5.eposx

TreeTagger-DE.eposx

stanza-en.eposx

ETAX Addition

To include an EPOSX file in an ETAX file, the attribute eposx should be added to the <ETAX> element where the value of the attribute is the filename of the EPOSX file.

Example:

<etax id="…" sifx="…" eposx="C5.eposx">

EPOSX Template

It is encouraged that you use one of the EPOSX files from our GitHub to create your custom EPOSX. However, here is the basic template needed for the file.

<eposx name="C5" wordSeparator="_" ambiguitySeparator="-">

<userTags>

<tag code="wh" name="Wh- Words" />

</userTags>

<markups>

<markup text="AJC" code="adj.comp" />

</markups>

</eposx>

EPOSX Attributes

Attribute Example Values Description
title C5
C8
Stanza-English
Give a name to the type of part of speech, preferably in reference to the name of the tagger used.
wordSeparator _ (Default)
/
The character that is used to mark a word with the part of speech. If an underscore is the word separator, then your text should look like “word_NN0”.
ambiguitySeparator - (Default)
?
This character is used within part of speech codes. Taggers may provide two parts of speech if there is ambiguity. If a hyphen is used for the ambiguity separator, then your text should look like “swimming_NN0-VBG”.

WordCruncher books are prepared with an underscore as the word separator and a hyphen as the ambiguity separator, as shown below:

<eposx title="C5" wordSeparator="_" ambiguitySeparator="-">

<eposx title="C8" wordSeparator="_" ambiguitySeparator="-">

Primary Part of Speech Tags

WordCruncher has predefined tags that must be the first tag listed in each part of speech code. We call these “Primary” tags. Primary tags can also be used as “Secondary” tags if needed. For example, “n.adv” can be used to mark an adverbial noun. Below are listed the Primary tags. Use the “Tag Code” when defining a part of speech code. The “Tag Name” is given merely for convenience and readability.

Tag Name Tag Code
adjective adj
adverb adv
alphabet alph
article art
circumposition circ
classifier clf
clitic clitic
conjunction conj
determiner det
existential exist
interjection interj
noun n
null null
numeral num
other oth
particle ptcl
postposition postp
preposition prep
pronoun pron
punctuation punct
unclassified uncl
verb v

Secondary Part of Speech Tags

Secondary tags are categorized in this table by the grammatical category that are usually associated with them. However, they can be used as any category based on your part of speech schema. It is recommended that only one tag per category be used when defining a part of speech code. User tags can be added to this list of secondary tags if further defintion is required.

Tag Name Tag Code
Verb Type
lexical lex
auxiliary aux
modal mod
semiauxiliary semiaux
Noun Type
common comm
proper prop
Common Noun Type
unit unit
direction dir
temporal temp
Action Noun Type
subject sbj
object obj
Adjective/Adverb Type
comparative comp
superlative superl
evaluative eval
positive pos
negative neg
attributive attr
predicative pred
degree deg
Numeral Type
cardinal card
ordinal ord
fraction frac
Conjunction Type
coordinating coord
subordinating subord
correlative corr
Pronoun/Determiner Type
definite def
indefinite indef
demonstrative dem
exclamative excl
interrogative interrog
personal pers
reflexive refl
irreflexive irrefl
relative rel
substitutive substit
Tense
conditional cond
future fut
past past
present pres
Aspect
imperfect imperf
perfect perf
progressive prog
aorist aor
pluperfect pluperf
Voice
active act
passive pass
middle mid
causative caus
Mood
indicative indic
imperative imp
subjunctive subj
optative opt
infinitive inf
finite fin
gerund ger
participle ptcp
Person
first person 1
second person 2
third persion 3
Case
nominative nom
genitive gen
possessive poss
dative dat
accusative acc
locative loc
vocative voc
instrumental instr
absolutive abs
ergative erg
Gender
masculine masc
feminine fem
neuter neut
universal univ
Animacy
animate anim
inanimate inanim
human hum
nonhuman nonhum
Number
singular sing
plural pl
dual dual
Honorifics
formal form
informal inform
polite pol
royal roy
Abbreviation
abbreviation abbr
contraction contr
foreign foreign
headline head
title title
marker mark
pronominal pronom
truncated trunc

User-Created Part of Speech Tags

In the <userTags> section of an EPOSX, a user can add up to 64 tag elements that are not specified in secondary tags list. Codes must have no spaces, word separator character, or ambiguity separator character.

Individual tags need to be in an empty <tag/> element. There are two attributes for this tag: code and name.

TABLE

Below are some examples of tags that you can make.

<userTags>

<tag code="wh" name="Wh-words" />

<tag code="that" name="That" />

<tag code="of" name="Of" />

<tag code="be" name="Be verbs" />

<tag code="do" name="Do verbs" />

<tag code="have" name="Have verbs" />

</userTags>

Part of Speech Markups

The <markups> element is the XML element where all markups defined by external taggers are given WordCruncher part of speech code equivalents using the <markup/> element. This element has two required attributes: text and code.

TABLE

Example Text

With the EPOSX prepared correctly, the text in the ETAX needs to follow the character separators. An example paragraph should look like this:

<p>The_AT quick_JJ brown_JJ fox_NN1 jumped_VVD over_II the_AT lazy_JJ dog_NN1.</p>

Alternatively, a user can use the default WordCruncher markup for tagging text. If done this way, you won't need to add anything else to the <EPOSX>.

<p>The_art quick_adj brown_adj fox_n.sing jumped_v.past over_prep the_art lazy_adj dog_n.sing.</p>

Miscellaneous Notes

Available Features: Search features and the Neighborhood Report are available for part of speech books. Other reports, such as the Phrase Compare Report, will be updated later to use POS functionality.

Punctuation: Normal punctuation that is not defined in as a word separator or ambiguity separator will not be used for parts of speech. That means that it’s okay to have punctuation right after a part of speech tag without adding a zero space <zs/> tag before punctuation.
For example, if a sentence ends with “dog_NN1.”, then the period at the end will not be indexed as part of the POS tag.

If the word separator or ambiguity separator are part of words, then a <ch> tag should be wrapped around these characters. For example, my_function_NN0 will cause a problem because the underscore initiates the part of speech. Since function_NN0 is probably not the desired POS, this should be modified to my<ch>_</ch>function_NN0.

Attributes, Tag Words, Reference Levels: Any text that is in an attribute, tag word, or reference level will not be indexed for POS tags.