Feature description: User Directories

What is a "User Directory"?

A "user directory" is a folder structure containing copies of specific Salience data files with modifications in order to override the default data files found in the Salience data directory. The structure of a user directory mimics that of the data directory, containing folders for customizations to POS tagger operations, chunker operations, and higher level operations such as entity extraction or sentiment analysis. An example user directory is provided in the Salience distribution within the data directory as the folder labeled user. This example user directory does not contain any files to override the default data files, only the structure needed for a user directory.

Why/when would I use a user directory?

The primary purpose behind a user directory is to customize the out-of-the-box analysis of content. A user directory allows these customizations to be organized anywhere on the filesystem, separate from the default data directory files provided in the Salience installation. Additionally, you can maintain multiple user directories to organize different customization needs, using the appropriate user directory as demanded by your content or business rules.

For example, let's assume I am analyzing customer surveys for three different customers. I store the free text responses from the surveys in a relational database along with the associated meta-data for each survey. I may extract the surveys from the last month for client A, and there are certain entity extraction and sentiment analysis customizations for this client. When I instantiate a Salience session, I specify the default data directory and the user directory containing my customizations for client A. If I wish to then process a batch of content for client B, I simply set the user directory to use for processing to one I have created that contains the customizations for client B.

How do I create and use a user directory?

Creating a user directory is easy, simply create a copy of data/user, and name as needed for your purposes. As mentioned above, multiple user directories can be created on your filesystem to support your business needs. Using the user directory requires that the path to it be specified with instantiating a Salience session, or setting the user directory for use on a Salience session after initialization using the User Directory option.

Let's take the example above of analyzing customer surveys for three different customers. We'll assume a default installation on Windows, where it has been installed to C:\Program Files\Lexalytics.

  1. 1) Copy C:\Program Files\Lexalytics\data\user to C:\<path>\clientA
  2. 2) Copy any files to override from C:\Program Files\Lexalytics\data to C:\<path>\clientA
  3. 3) Repeat to create C:\<path>\clientB
  4. 4) Repeat to create C:\<path>\clientC
  5. 5) At runtime, use business rules to determine which user directory to use when instantiating a Salience object to analyze content, specifying the appropriate path to the user directory.

What can I put in a user directory?

The contents of a user directory are largely driven by the subfolders, which replicate the structure of the Salience data directory. The files that can be overridden within the user directory are documented in the Salience 6 Data Directory section of this wiki. Within each subfolder (ex. /tagger) the documentation specifies files that can be overridden through the user directory.

Overriding the default POS tagging of a particular token is a good example of the user directory. Let's take the following excerpt from an article that appeared in the NYTimes on August 11, 2015:

"Google was founded as a company that did Internet search. Over time, it has broadened into areas as varied as drones, pharmaceuticals and venture capital, none of which make much money, and some of which have spooked investors."

Most, if not all, of the part-of-speech tagging on this text looks correct. However, I notice that "Internet" is tagged as a common noun, not a proper noun. Either classification is justifiable, but let's say I want "Internet" to always be tagged as a proper noun, that there is one and only one Internet.

The documentation for the /tagger directory lists the files found in the default data directory data/tagger and indicates which can be overridden in a user directory. One of the files listed is customlexicon.dat. Here's how we can use the user directory to enforce this customization for client A mentioned above:

  1. 1) Copy C:\Program Files\Lexalytics\data\tagger\customlexicon.dat to C:\<path>\clientA\tagger\customlexicon.dat
  2. 2) Edit C:\<path>\clientA\tagger\customlexicon.dat and add this line:
Internet<tab>NNP
This instructs the POS tagger to apply the proper noun tag "NNP" to any occurrences of "Internet" (for more information on POS tags, see Supported part-of-speech tags)
  1. 3) When initializing Salience to analyze content for client A, I specify the user directory C:\<path>\clientA and that customization is applied.

If I'm analyzing data for client B or client C with their respective user directories, Internet retains the default part-of-speech tag of "NN" (common noun).

NOTE: There are certain files that can reside anywhere on the filesystem, because they are specified via options. However, for ease of organization, it's recommended to retain them in your user directory.

Back to top