The themes directory contains the files that control theme extraction at all levels; documents, entities, and collections.
The following files may be customized by users in the
themes section of a
user directory. Click the name of the file for more information below.
|Part-of-speech patterns that define theme extraction|
|A list of words that should not be considered for themes|
|Rules for relating themes together|
This controls the POS rules that determine if a combination of words is a theme or not. It is uses the Pattern File format.
This file is used to eliminate phrases that would match the POS rules contained within rules.ptn but are too common to be considered useful, last week for example.
The file is a single column
.dat file. It can contain both single words and phrases (multi-word)
Single words will act as a stop on any phrase containing them:
hellowill stop any phrase appearing that contains the word hello
Phrases will act as a stop on that particular phrase:
next weekwill stop next week, it will not stop sometimes next week
NOTE: stopwords.dat is case insensitive.
NOTE: Salience does NOT ship with a normalization.dat by default.
If you create a normalization.dat, it is possible to normalize multiple different themes into the same theme. This is useful if you want to do some sort of roll-up. For example, you could normalize poor sound, great sound and good quality speakers into ''audio quality''.
To enable theme normalization create a normalization.dat under /data/user/themes with each entry in the format:
NOTE: theme can either be the unstemmed or stemmed form