A "user directory" is a folder structure containing copies of specific Salience data files with modifications in order to override the default data files found in the Salience data directory. The structure of a user directory mimics that of the data directory, containing folders for customizations to POS tagger operations, chunker operations, and higher level operations such as entity extraction or sentiment analysis. An example user directory is provided in the Salience distribution within the data directory as the folder labeled user
. This example user directory does not contain any files to override the default data files, only the structure needed for a user directory.
The primary purpose behind a user directory is to customize the out-of-the-box analysis of content. A user directory allows these customizations to be organized anywhere on the filesystem, separate from the default data directory files provided in the Salience installation. Additionally, you can maintain multiple user directories to organize different customization needs, using the appropriate user directory as demanded by your content or business rules.
For example, let's assume I am analyzing customer surveys for three different customers. I store the free text responses from the surveys in a relational database along with the associated meta-data for each survey. I may extract the surveys from the last month for client A, and there are certain entity extraction and sentiment analysis customizations for this client. When I instantiate a Salience session, I specify the default data directory and the user directory containing my customizations for client A. If I wish to then process a batch of content for client B, I simply set the user directory to use for processing to one I have created that contains the customizations for client B.
Creating a user directory is easy, simply create a copy of data/user, and name as needed for your purposes. As mentioned above, multiple user directories can be created on your filesystem to support your business needs. Using the user directory requires that the path to it be specified with instantiating a Salience session, or setting the user directory for use on a Salience session after initialization using the User Directory option.
Let's take the example above of analyzing customer surveys for three different customers. We'll assume a default installation on Windows, where it has been installed to C:\Program Files\Lexalytics
.
C:\Program Files\Lexalytics\data\user
to C:\<path>\clientA
C:\Program Files\Lexalytics\data
to C:\<path>\clientA
C:\<path>\clientB
C:\<path>\clientC
path
to the user directory.
The contents of a user directory are largely driven by the subfolders, which replicate the structure of the Salience data directory. The files that can be overridden within the user directory are documented in the Salience 6 Data Directory section of this wiki. Within each subfolder (ex. /tagger) the documentation specifies files that can be overridden through the user directory.
Overriding the default POS tagging of a particular token is a good example of the user directory. Let's take the following excerpt from an article that appeared in the NYTimes on August 11, 2015:
"Google was founded as a company that did Internet search. Over time, it has broadened into areas as varied as drones, pharmaceuticals and venture capital, none of which make much money, and some of which have spooked investors."
Most, if not all, of the part-of-speech tagging on this text looks correct. However, I notice that "Internet" is tagged as a common noun, not a proper noun. Either classification is justifiable, but let's say I want "Internet" to always be tagged as a proper noun, that there is one and only one Internet.
The documentation for the /tagger directory lists the files found in the default data directory data/tagger
and indicates which can be overridden in a user directory. One of the files listed is customlexicon.dat. Here's how we can use the user directory to enforce this customization for client A mentioned above:
C:\Program Files\Lexalytics\data\tagger\customlexicon.dat
to C:\<path>\clientA\tagger\customlexicon.dat
C:\<path>\clientA\tagger\customlexicon.dat
and add this line:
Internet<tab>NNP
C:\<path>\clientA
and that customization is applied.
If I'm analyzing data for client B or client C with their respective user directories, Internet retains the default part-of-speech tag of "NN" (common noun).
NOTE: There are certain files that can reside anywhere on the filesystem, because they are specified via options. However, for ease of organization, it's recommended to retain them in your user directory.
<path to user directory>/salience/entities
, specified via User Entity List option
<path to user directory>/salience/sentiment
, specified via Sentiment Dictionary option
<path to user directory>/salience/tags
, specified via Topic List option
<path to user directory>/salience/concepts
, specified via Concept Topic List option
<path to user directory>/salience/classification
, specified via Classification model option