Sizing your Salience Five deployment

December 22nd, 2011 by Carl Lambrecht

This is another extract from our customer files. Not something that comes up all the time, but often enough that it warranted a blog article with a good worked example.

In general, Salience Engine has been and continues to be very economical in terms of hardware requirements. Text analytics with Salience Engine is more CPU intensive than I/O or memory intensive, though the inclusion of the Concept Matrix™ in Salience Five has increased the memory footprint.

So let’s say you’re looking to process 2 million documents per day, where half are tweets and half are news articles of 4kb or less. What kind of hardware spec are you looking at? Read on to see how you could spec out handling this amount of content with Salience Five.

Throughput

Let’s start with determining how many CPUs we will need to handle the design volume of content. A single Salience session can handle a certain volume, and we can run multiple Salience sessions in parallel to maximize throughput.

NOTE: Lexalytics recommends no more than one Salience session per available CPU core because of the CPU-intensive nature of the text analytics processing.

There are two throughput benchmarks to bear in mind. Salience Five can chew through tweets much faster than news or blog articles, because tweets are small. The general benchmarks for throughput are as follows:

200 tweets per second per core

4 docs per second per core

These benchmarks are measured on an Intel i7 CPU, with a test rig written against our C API, processing each item for entities, themes, and sentiment, where a document averages about 4kb. So to handle the volume we are looking for:

200 tweets/sec/core = 17.28M tweets/day/core

Not a problem. We can allocate a single CPU on a server for processing tweets and handle 1 million. Now let’s look at documents:

4 docs/sec/core = 345.6K docs/day/core

These take a bit more horsepower to handle. But 3 Salience sessions each running on their own CPU core will handle it. If we need to chew through them faster, we could add more CPUs. An 8 CPU server appears to have plenty horsepower to handle the volume of content we’re expecting, with room for growth.

Memory

In addition to the computing power needed by Salience, we also need to consider the memory footprint of a Salience session. The memory used by a single Salience session is driven by the datafiles. Salience Five, in particular, adds a very large datafile for concept matrix functionality. This datafile can be removed in situations where Concept Matrix™ functionality is not being used to reduce the memory footprint. First, let’s consider what we need if we keep it in.

The base memory footprint for a single Salience session with the Concept Matrix™ is 1006MB, or roughly 1GB.

Additional memory usage on the order 1MB per concept topic has been observed which can be a factor as you increase the number of concept topics being used.

The additional memory footprint for a document being processed is approximately 120% the size of the document. For a tweet, which is 140 characters in length, this overhead is negligible. For the design non-Twitter document size of 4kb, the memory footprint of a document being processed is still not a significant factor in memory sizing.

There are two methods to deploy multiple Salience sessions in parallel; multiple threads or multiple processes.

With multiple threads, you can achieve lower memory footprint due to shared memory across threads within the same process. This comes with the risk of an exception in one thread halting the parent process and thus all other child threads:

Memory footprint for a single Salience Session: 1GB

Memory footprint for each additional Salience session: 54MB (round up to 60MB)

Total memory for 4 threads: 1GB + (3*60MB) = 1.2GB

Multiple processes are more fault tolerant, but each process requires same memory footprint for Salience:

Memory footprint for a single Salience Session: 1GB

Memory footprint for each additional Salience session: 1GB

Total memory for 4 processes: (4*1GB) = 4GB

Yikes, that sounds a bit big. But to handle our throughput, we were looking at 1 Salience session for tweets and 3 Salience session for documents. So overall, we can still handle this with 4GB RAM. My 64bit laptop has 8GB RAM. Thanks to Moore’s Law, we’re still in the right ballpark. But it does indicate that Salience Five is more suited to 64bit platforms than 32bit platforms.

Let’s say we want to strip Salience Five down to the basics though. No Concept Matrix™, just straight entities, themes, and sentiment. In that case, we can remove the concept matrix data files, and memory footprint is much smaller.

Memory footprint for a single Salience Session: 130MB

Memory footprint for each additional Salience session: 130MB

Total memory for 4 processes: (4*130MB) = 520MB

Conclusion

Hopefully this worked example shows you how to size your hardware based on the volume of content you are looking to analyze with Salience Five, and some of the tradeoffs and decisions you can make.

The big takeaway should be that this is very complex and valuable processing that can still be run on commodity hardware, or can scale onto some heavy iron for major production deployments with high content volumes.

Comments are closed.