Multi-language support in Salience

October 12th, 2012 by Carl Lambrecht

At Lexalytics, we know it’s not only a global marketplace, but a multi-lingual global marketplace. It’s this understanding which has driven us to extend the capabilities of Salience beyond analysis of only English content. This article details the evolution and current state of our support for performing text analytics on English and non-English content.

Salience was originally developed by Lexalytics to analyze English text, traditionally from online news articles and blogs. Over the continued development and maturation of the product, more and more complex analysis of online content, including Twitter, has been added and improved.

Language Core NLP Entities Themes Sentiment Query

Topics

Other

Features

EN Y Y Y Phrase/Model Y Y

2010: French support introduced

The first non-English language we tackled was French. Our first release of support for French provided almost all of the features found in our analysis of English; part-of-speech tagging, novel named entity extraction, theme extraction, summarization, and model-based sentiment analysis. The only feature in our English language support that was not available in this initial release was phrase-based sentiment analysis. An update followed shortly which added improved handling of French social media content.

Language Core NLP Entities Themes Sentiment Query

Topics

Other

Features

EN Y Y Y Phrase/Model Y Y
FR (2010) Y Y Y Model Y Y

2011: Spanish and Portuguese introduced

The next languages we tackled were Spanish and Portuguese, again providing model-based sentiment analysis. These languages have not yet undergone an update to specifically address unique social media complications in those languages, yet they do benefit from the techniques built into Salience itself for adjusting to handle short content.

Language Core NLP Entities Themes Sentiment Query

Topics

Other

Features

EN Y Y Y Phrase/Model Y Y
FR (2010) Y Y Y Model Y Y
ES (2011) Y Y Y Model Y Y
PT (2011) Y Y Y Model Y Y

2012: German introduced

With Salience 5.0, Lexalytics introduced the core feature of the Concept Matrix™, a model of semantic relationships derived from Wikipedia. At the same time, we were developing support for German. Additional research and development allowed us to provide phrase-based sentiment analysis for German, our first non-English language pack to include this capability.

Language Core NLP Entities Themes Sentiment Query

Topics

Concepts Other

Features

EN Y Y Y Phrase/Model Y Y Y
FR (2010) Y Y Y Model Y N Y
ES (2011) Y Y Y Model Y N Y
PT (2011) Y Y Y Model Y N Y
DE (2012) Y Y Y Phrase/Model Y Y Y

2012: Concept Matrix™ and phrase-based sentiment for non-English support introduced

Over the course of the summer of 2012, we worked to develop updates to our existing language packs that would bring all of them up to the same feature level. Updates have been released for French (June 2012), Spanish (September 2012), and Portuguese (October 2012) that bring all currently supported languages up to the same level of functionality.

Language Core NLP Entities Themes Sentiment Query

Topics

Concepts Other

Features

EN Y Y Y Phrase/Model Y Y Y
FR (2012) Y Y Y Phrase/Model Y Y Y
ES (2012) Y Y Y Phrase/Model Y Y Y
PT (2012) Y Y Y Phrase/Model Y Y Y
DE (2012) Y Y Y Phrase/Model Y Y Y

2013: Chinese

While our currently language packs will see ongoing improvements (better novel entity extraction, better quotation extraction), we have our sights firmly set outside the Romance and Germanic languages. We are hard at work at applying our experience and techniques to Chinese, looking to release our initial support in early 2013.

Comments are closed.