Part of Speech Documentation

Updated as of 24 August 2020.

WordCruncher has the ability to attach a part of speech code to each word and include it with the word information in the index (Word Wheel). These part of speech codes are built with predefined tags (see below for a chart of tags) which can be organized in a hierarchical manner. For instance, the part of speech code for a singular, common noun can be written with the three tags: “n.comm.sing”. While WordCruncher provides a rich set of predefined part of speech tags, there are occasions where they may be insufficient to fully describe a part of speech code. In these cases, some user defined tags can be added.

The Indexer does not currently tag texts for part of speech. It is recommended that you use an external part of speech tagger like Stanford NLP or TreeTagger to tag your texts. Other resources like NLTK, Stanza, or Spacy with Python are also useful for tagging texts. These part of speech taggers produce abbreviated markup codes like NN1 for singular nouns and NN2 for plural nouns. WordCruncher can also accept these markup codes for parts of speech, but an XML file called the EPOSX is needed to define how these markups are translated to WordCruncher part of speech codes.

Below are a few example EPOSX files for download. They can be used or modified to fit your needs.

C5.eposx

TreeTagger-DE.eposx

stanza-en.eposx

ETAX Addition

To include an EPOSX file in an ETAX file, the attribute eposx should be added to the <ETAX> element where the value of the attribute is the filename of the EPOSX file.

Example:

EPOSX Template

It is encouraged that you use one of the EPOSX files from our GitHub to create your custom EPOSX. However, here is the basic template needed for the file.

</userTags>

</markups>

</eposx>

EPOSX Attributes

Attribute	Example Values	Description
title	C5 C8 Stanza-English	Give a name to the type of part of speech, preferably in reference to the name of the tagger used.
wordSeparator	_ (Default) /	The character that is used to mark a word with the part of speech. If an underscore is the word separator, then your text should look like “word_NN0”.
ambiguitySeparator	- (Default) ?	This character is used within part of speech codes. Taggers may provide two parts of speech if there is ambiguity. If a hyphen is used for the ambiguity separator, then your text should look like “swimming_NN0-VBG”.

WordCruncher books are prepared with an underscore as the word separator and a hyphen as the ambiguity separator, as shown below:

Primary Part of Speech Tags

WordCruncher has predefined tags that must be the first tag listed in each part of speech code. We call these “Primary” tags. Primary tags can also be used as “Secondary” tags if needed. For example, “n.adv” can be used to mark an adverbial noun. Below are listed the Primary tags. Use the “Tag Code” when defining a part of speech code. The “Tag Name” is given merely for convenience and readability.

Tag Name	Tag Code
adjective	adj
adverb	adv
alphabet	alph
article	art
circumposition	circ
classifier	clf
clitic	clitic
conjunction	conj
determiner	det
existential	exist
interjection	interj
noun	n
null	null
numeral	num
other	oth
particle	ptcl
postposition	postp
preposition	prep
pronoun	pron
punctuation	punct
unclassified	uncl
verb	v

Secondary Part of Speech Tags

Secondary tags are categorized in this table by the grammatical category that are usually associated with them. However, they can be used as any category based on your part of speech schema. It is recommended that only one tag per category be used when defining a part of speech code. User tags can be added to this list of secondary tags if further defintion is required.

Tag Name	Tag Code
Verb Type
lexical	lex
auxiliary	aux
modal	mod
semiauxiliary	semiaux
Noun Type
common	comm
proper	prop
Common Noun Type
unit	unit
direction	dir
temporal	temp
Action Noun Type
subject	sbj
object	obj
Adjective/Adverb Type
comparative	comp
superlative	superl
evaluative	eval
positive	pos
negative	neg
attributive	attr
predicative	pred
degree	deg
Numeral Type
cardinal	card
ordinal	ord
fraction	frac
Conjunction Type
coordinating	coord
subordinating	subord
correlative	corr
Pronoun/Determiner Type
definite	def
indefinite	indef
demonstrative	dem
exclamative	excl
interrogative	interrog
personal	pers
reflexive	refl
irreflexive	irrefl
relative	rel
substitutive	substit
Tense
conditional	cond
future	fut
past	past
present	pres
Aspect
imperfect	imperf
perfect	perf
progressive	prog
aorist	aor
pluperfect	pluperf
Voice
active	act
passive	pass
middle	mid
causative	caus
Mood
indicative	indic
imperative	imp
subjunctive	subj
optative	opt
infinitive	inf
finite	fin
gerund	ger
participle	ptcp
Person
first person	1
second person	2
third persion	3
Case
nominative	nom
genitive	gen
possessive	poss
dative	dat
accusative	acc
locative	loc
vocative	voc
instrumental	instr
absolutive	abs
ergative	erg
Gender
masculine	masc
feminine	fem
neuter	neut
universal	univ
Animacy
animate	anim
inanimate	inanim
human	hum
nonhuman	nonhum
Number
singular	sing
plural	pl
dual	dual
Honorifics
formal	form
informal	inform
polite	pol
royal	roy
Abbreviation
abbreviation	abbr
contraction	contr
foreign	foreign
headline	head
title	title
marker	mark
pronominal	pronom
truncated	trunc

User-Created Part of Speech Tags

In the <userTags> section of an EPOSX, a user can add up to 64 tag elements that are not specified in secondary tags list. Codes must have no spaces, word separator character, or ambiguity separator character.

Individual tags need to be in an empty <tag/> element. There are two attributes for this tag: code and name.

TABLE

Below are some examples of tags that you can make.

</userTags>

Part of Speech Markups

The <markups> element is the XML element where all markups defined by external taggers are given WordCruncher part of speech code equivalents using the <markup/> element. This element has two required attributes: text and code.

TABLE

Example Text

With the EPOSX prepared correctly, the text in the ETAX needs to follow the character separators. An example paragraph should look like this:

<p>The_AT quick_JJ brown_JJ fox_NN1 jumped_VVD over_II the_AT lazy_JJ dog_NN1.</p>

Alternatively, a user can use the default WordCruncher markup for tagging text. If done this way, you won't need to add anything else to the <EPOSX>.

<p>The_art quick_adj brown_adj fox_n.sing jumped_v.past over_prep the_art lazy_adj dog_n.sing.</p>

Miscellaneous Notes

Available Features: Search features and the Neighborhood Report are available for part of speech books. Other reports, such as the Phrase Compare Report, will be updated later to use POS functionality.

Punctuation: Normal punctuation that is not defined in as a word separator or ambiguity separator will not be used for parts of speech. That means that it’s okay to have punctuation right after a part of speech tag without adding a zero space <zs/> tag before punctuation.
For example, if a sentence ends with “dog_NN1.”, then the period at the end will not be indexed as part of the POS tag.

If the word separator or ambiguity separator are part of words, then a <ch> tag should be wrapped around these characters. For example, my_function_NN0 will cause a problem because the underscore initiates the part of speech. Since function_NN0 is probably not the desired POS, this should be modified to my<ch>_</ch>function_NN0.

Attributes, Tag Words, Reference Levels: Any text that is in an attribute, tag word, or reference level will not be indexed for POS tags.