<Lexalytics root>/data/tagger

<<Structure | Back to Data Directory index | Themes>>

For users comfortable with regular expressions and the concepts behind a Part of Speech (POS) tagger, you can modify the basic part of speech tagging to handle complex text strings like hyphenated product names, dates and other special text strings.

For each file, click on the filename for more detailed information below.

customlexicon.dat A list of words with known POS tags
gluepatterns.ptn A pattern file applied during tokenization
gluetokens.dat A case-sensitive list of words that should not be split into individual tokens
uri_suffixes.dat A list of extensions used in file and website URI (uniform resource identifiers)

Refer to the List of supported POS tags for information about the part-of-speech tags used within the data files for Salience Engine.

Customizing the tagger in user/tagger

customlexicon.dat

This file allows you to override the POS tagging for individual words. It uses case-sensitive exact matching. Each entry is of the form:

<word><tab><POS tag>

For example:

eBay NNP
iPhone NNP
www.foo.com URL

gluepatterns.ptn

The glue file allows the user to glue-together items that are split apart by the tokenizer, overriding their POS tag at the same time. The syntax of the glue file is as follows:

{pattern}=>{LABEL}

For example to recognize the smiley face :-) in this sentence you'd build a glue pattern like the following:

(=':') (='-') (=')')=>SMILEY

The result of a glue statement like that shown above is a part of speech tagged document with a special marker for smiley faces. Given the sentence:

"Bob Smith is one happy camper :-)"

You get the following tagged text:

Bob/NNP Smith/NNP is/VBZ one/CD happy/JJ camper/NN :-)/SMILEY ./.

The same concept can be applied to any specialty text that the user wishes to tag, like Dates or product names.

Note: If no tag (like SMILEY) is given to the string then it will be tagged using the normal tagging rules (lexicon lookup falling back on a maximum entropy model).


gluetokens.dat

This is a case-sensitive list of words that should not be split up by the tokenizer when they appear contiguously in the text, which can also impact how words are tagged. For example:

x-box
x-ray
P&G

This will prevent the string x-box from being tokenized as x, -, box causing it instead to produce a single token.

You can make an entry case insensitive by prefixing it with a ~ so

~p&g

would make all occurrences of P&G be tokenized as a single token regardless of case.


uri_suffixes.dat

This data file provides a list of common Internet domain suffixes and file format extensions.