<Lexalytics root>/data/tokenizer

<<Themes | Back to Data Directory index | User>>

This directory contains the files that control the operation of the tokenizer. The tokenizer is used to identify the individual tokens in a block of content. Generally a token is equivalent to a single word, but punctuation, contractions, symbols, etc. are also examples of tokens that occur within text.

The following files may be customized by users in the tokenizer section of a user directory. Click the name of the file for more information below.

breaks.dat A list of words containing periods (common abbreviations)
complexstems.dat A list of words that are commonly expanded in social media content to emphasize sentiment
sentencepunctuation.dat List of common characters used as terminating sentence punctuation
subwords.dat A list of words that can be found concatenated in social media content, particularly in hashtags
suffixbreaks.dat A list of contractions recognized by the tokenizer
tokenizer.dat A tokenizer-specific data file

Customizing tokenizer functionality in user/tokenizer

breaks.dat

These are words that end in an end-of-sentence marker but should not be interpreted to break a sentence. For example:

Mr.
U.S.

This will prevent text like The U.S.-based company, owned by Mr. Catlin and Mr. Marshall, ... from being broken up into multiple sentences.


complexstems.dat

These are words that are commonly expanded in social media content, and done so in order to add emphasis to sentiment. This file allows a sentiment multiplier to be applied when encountering any expansions of these words.

The format of the file is:

<word-that-could-be-expanded> <tab> <sentiment-multiplier>

Example:

love    1.2

This would apply a multiplier of 1.2 to sentiment detected for the word "love" in occurrences such as "I loooove Salience!".


sentencepunctuation.dat

This datafile contains common sentence terminating punctuation. It can be extended by users in user/tokenizer if additional terminating punctuation is observed in target content that is being analyzed in Salience.


subwords.dat

These are words that are commonly concatenated in social media content, specifically in hashtags. Salience uses this file to deconstruct concatenated phrases in hashtags in order to apply sentiment.


suffixbreaks.dat

These are contractions recognized by the tokenizer and separated out into tokens separate from the root (generally a pronoun) for improved POS tagging. Users can extend this datafile in user/tokenizer if additional contractions are noted in the target content being analyzed by Salience.


tokenizer.dat

This is a datafile specific to the Salience tokenizer and should not be modified or overridden.