Salience 6.0 introduced the ability to process content through multiple "configurations" at the same time. A configuration is a user directory that is identified to a Salience session via a label. Subsequently, that label can be used to retrieve results specific to the customizations in that user directory and independent of other customizations. A single Salience 6.x session can support processing of content through multiple configurations at the same time.
If you are not familiar with user directories, and their use in Salience, please refer to the feature article on User Directories.
Note: Because a Salience session supporting multiple configurations is still instantiated with a language-specific data directory, the customizations in the various configurations still needs to apply to a single target language.
Many of our customers are running general content streams (like Twitter, or a news feed) through Salience, but have different clients that have different needs out of the analysis of this general stream. Different clients may have different entity extraction needs, different sentiment tunings, different categorization via topics. The goal is to provide a flexible methodology to submit the content for processing to one Salience session, but get the benefit of retrieving results as if it was submitted to multiple Salience sessions instantiated with different user directories. The multiple configurations feature in Salience 6 supports this exact case.
Differences and benefits in content processing
So let's take that use case of processing a news feed for multiple clients. For the sake of this example, we'll call them A, B, and C. If I have customizations that are particular to each individual client I need to process each document three times, once for each client using a user directory and other customizations such as custom sentiment phrases (see Feature description: Sentiment Tuning for more information on custom sentiment phrase dictionaries).
I could have three parallel Salience sessions, each initialized to the user directory for the client and other options specific to the client. Each piece of content would be sent to all three Salience sessions, and results gathered from them individually. But, having three parallel Salience sessions running incurs a larger memory footprint. To avoid the expanded memory footprint to run three parallel sessions, I could run a single session, and process the same piece of content three times, swapping the user directory each time. But that incurs a significant performance hit to reprocess the same piece of content multiple times and for the user directory switching. The best I can do is batch process a set with one user directory, and then repeat for the other user directories.
Enter multiple configurations. By setting up all three as separate configurations within the same Salience session, I process the content once. The initial step of "preparing" the text, which comprises the majority of processing for a single document, takes place once for each piece of content. The subsequent requests for entities, sentiment phrases, etc. can be performed on a per-configuration basis just as if the content had been processed in a single Salience session initialized with the applicable user directory.
Continuing with our example of processing a feed of content for clients A, B, and C, we're going to state these initial assumptions:
- Processing on Windows, with a default installation of Salience in
- User directories to contain customizations for each client placed in a directory
- These user directories also contain customizations to data files that are not required in the user directory such as sentiment files and query topic definitions. These can reside anywhere on the filesystem because they are specified by path by option methods in the API But for the sake of organization, they will reside in the user directory (e.g.
With all of that set up, let's use these configurations in a single Salience session. The code snips provided below are based on the Salience .Net wrapper, similar approach exists in the other wrappers provided in the Salience distribution.
1) Start by initializing a Salience session with the path to your license file and path to the default data directory
Engine = new Salience(licensePath, dataPath);
2) For each client:
2a) Add a Salience configuration using the path to the user directory for that customer
2b) Set the current configuration for options in order to specify further options for this config
Engine.CurrentOptionSession = "clientA";
2c) Set the query topic file to use for this configuration
Engine.TopicList = "C:/SalienceConfigs/clientA/salience/tags/clientA_queries.dat";
2d) Set the sentiment phrase dictionary to use for this configuration
Engine.AddSentimentDictionary("C:/SalienceConfigs/clientA/salience/sentiment/clientA_phrases.hsd", false, "clientA");
Now we're ready to run data. For each piece of content that we process, we can request results specific to a configuration:
// Get sentiment for individual clients
SalienceSentiment sentimentA = Engine.GetDocumentSentiment(true,"clientA");
SalienceSentiment sentimentB = Engine.GetDocumentSentiment(true,"clientB");
SalienceSentiment sentimentC = Engine.GetDocumentSentiment(true,"clientC");
// Get entities for client B
List<SalienceEntity> entities = Engine.GetNamedEntities("clientB");
// Get topics for client A
List<SalienceTopic> topics = Engine.GetQueryDefinedTopics("clientA");
When running a Salience session with a specific data directory and user directory, the memory footprint is about 2GB. The majority of that memory footprint is the default data files such as the POS tagging model, the entity extraction model, and the largest contributor is the Concept Matrix™. The files that comprise a user directory are generally very small. And it should be noted that there is some variance when working in different languages, most have smaller Concept Matrix™ binaries.
Based on the example above, system memory to support 3 configurations in a single Salience session will be:
As always, your mileage may vary. However, this is fairly minimal overhead for the benefit of a streamlined operation.