<Lexalytics root>/data/chunker

Back to Data Directory index | Languages>>

This directory contains files that support the chunker. For file that can be overridden/customized by users, click on the filename for more detailed information below.

auxiliary.dat A list of verbs that are commonly prefixed into longer verb phrases, e.g. "planned to go", "need to think", "gets to come"
copulas.dat Externalization of linking verbs considered by the chunker.
describers.dat Used by sentencetype.dat in identifying sentence types
discourse.dat Data file containing words that weight sentiment phrases following them in situations of mixed sentiment
instructive_modal.dat Used by sentencetype.dat in identifying sentence types
intensifiers.dat Data file of words that are a direct multiplier of adjacent sentiment-bearing phrases
negationbreaks.dat Externalization of patterns used by chunker to impact negation of chunks.
negations.dat Externalization of the negators considered by the chunker.

These files may be customized within a chunker section of a user directory, however it is not recommended.

Customizing chunker functionality in user/chunker

auxiliary.dat

Certain verbs commonly occur in longer verb phrases. This file lists verbs that fall into this category to ensure the two verbs are part of the same chunk.


copulas.dat

This file contains a list of verbs which the chunker uses as linking verbs. Note that this file does not contain all forms of individual verbs. For example, the verb "to be" in English is conjugated as "I am, you are, he/she/it is, we are, they are"; in this file you see the forms "are" and "be".


describers.dat

This file is used by sentencetype.dat in the identification of different sentence types, such as imperative sentences. If sentencetype.dat is overridden in a user directory, this file may also be needed.


discourse.dat

This file is used to specify words that weight the sentiment of phrases following them. For example:

The restaurant was nice and clean, but the food was awful.

The use of the word "but" indicates a change in sentiment, and the weight for "but" in discourse.dat allows for slightly higher weighting on the sentiment phrases found at the end of the sentence based on observations that they convey the true sentiment intended.


instructive_modal.dat

This file is used by sentencetype.dat in the identification of different sentence types, such as imperative sentences. If sentencetype.dat is overridden in a user directory, this file may also be needed.


intensifiers.dat

This file contains a list of words that will modify the the sentiment score of the next token only. The file format is:

<word> <tab> <intensifier-multiplier-amount>

Note: If a word is both an intensifier and a sentiment phrase, then it WILL NOT contribute its sentiment score to the document.


negationbreaks.dat

This file contains a list of words which will stop negation in the middle of a chunk.


negations.dat

This file contains a list of words that can negate (or invert) sentiment. The file format is:

<word-or-construction> <tab> <negation-multiplier-amount>

In all cases in the out-of-the-box default negations file, the inversion is an exact mirror. For example, the sentiment of "I enjoy watching baseball." is inverted when the following negation is encountered: "I never enjoy watching baseball."

Note: Entries without a negator multiplier amount are automatically assigned -1.0 for a value.