<Lexalytics root>/data/themes

<<Tagger | Back to Data Directory index | Tokenizer>>

The themes directory contains the files that control theme extraction at all levels; documents, entities, and collections.

The following files may be customized by users in the themes section of a user directory. Click the name of the file for more information below.

rules.ptn Part-of-speech patterns that define theme extraction
stopwords.dat A list of words that should not be considered for themes
normalization.dat Rules for relating themes together

Customizing theme extraction in user/themes

rules.ptn

This controls the POS rules that determine if a combination of words is a theme or not. It is uses the Pattern File format.


stopwords.dat

This file is used to eliminate phrases that would match the POS rules contained within rules.ptn but are too common to be considered useful, last week for example.

The file is a single column .dat file. It can contain both single words and phrases (multi-word)

Single words will act as a stop on any phrase containing them:

hello will stop any phrase appearing that contains the word hello

Phrases will act as a stop on that particular phrase:

next week will stop next week, it will not stop sometimes next week

NOTE: stopwords.dat is case insensitive.


normalization.dat

NOTE: Salience does NOT ship with a normalization.dat by default.

If you create a normalization.dat, it is possible to normalize multiple different themes into the same theme. This is useful if you want to do some sort of roll-up. For example, you could normalize poor sound, great sound and good quality speakers into ''audio quality''.

To enable theme normalization create a normalization.dat under /data/user/themes with each entry in the format:

  • [stemmed_theme]<tab>[normalized_form]