Feature highlight: Complex stems in Salience 5.1

November 16th, 2012 by Carl Lambrecht

With every release of Salience, we work to enhance the engine’s ability to extract meaning from the ever-changing landscape of unstructured content. That might sound like a mouthful of marketing mumbo-jumbo, but this is a blog site for developers, so what do I mean by that? What I mean is adding new methods, new techniques, and tweaking existing techniques to handle all kinds of text gracefully, and derive meaning out of the text. One such feature that was added in Salience 5.1 was a new option called Complex Stems. Let’s have a look at how and when you might use this option.

Let’s begin with the content we’re trying to analyze. Over the last few years, the analysis of online text has swung away from the more traditional sources of news sites and blogs that use reasonably clean and proper syntax and grammar. Twitter, Facebook, and review sites have become massive venues for anyone and everyone to express themselves. But ZOMG, ya gotta looooove da stuf U see! Some of it can be hard for a human to decode, so imagine what it looks like to a poor innocent i7 CPU.

One of the real strengths of Salience is the openness of the data directory. There are data files that allow customers to preprocess common misspellings or acronyms into correct forms (such as changing “da” to “the”), and other datafiles to specify the correct part-of-speech tag for jargon  (such as flagging “U” as a pronoun). But what about those cases of extended words like “looooove”? Wouldn’t it be nice if Salience could automagically compress them down?

Enter complex stems. Salience 5.1 introduces a new data file data/tokenizer/complexstems.dat (See what I did there? I gave you a link right into our developer documentation for that file). The contents of this data file work hand-in-hand with a new option added to the API, SALIENCEOPTION_COMPLEXSTEMS (There I go linking into the documentation again). With this option turned on, extensions of the word “love” are detected and a multiplier is applied to the sentiment weight for the occurrences.

This is what it looks like in practice using the python API:

>>> import saliencefive as se5
>>> session = se5.openSession('c:\program files\Lexalytics\license.v5','c:\program files\Lexalytics\data')
>>> ret = se5.prepareText(session, 'I loooove these new features!')
>>> sentiment = se5.getDocumentSentiment(session)
>>> sentiment['score']
0
>>> ret = se5.setOption_ComplexStems(session,1)
>>> ret = se5.prepareText(session, 'I loooove these new features!')
>>> sentiment = se5.getDocumentSentiment(session)
>>> sentiment['score']
0.7200000286102295
>>> sentiment['phrases'][0]['phrase']
'looooove'
>>> ret = se5.prepareText(session, 'I loooovvveee these new features!')
>>> sentiment = se5.getDocumentSentiment(session)
>>> sentiment['score']
0.7200000286102295

The option works equally well regardless of how many repeated characters or which repeated characters there are. It all boils down to the simple stem indicated in the data file and the multiplier is applied.

The multiplier is an important aspect of the functionality because often when users extend a word, it’s done to add more emphasis to what is being expressed. But, as with many things Salience, the multiplier is in your hands. You are free to tune that sensitivity up or down as you see fit through the data file.

The last thing to note is that this option is off by default, but if you start a Salience session in “short mode”, Salience sets certain options that are more appropriate for handling short content and this option is one of those that gets turned on.

Comments are closed.