LexUtils

Lex Utils currently contains two utility functions that can be helpful in preparing content for use with Salience. These are language detection and html extraction. Like Salience, Lex Utils is provided as a .so (linux) or .dll (windows) and wrappers in c, java, python and .NET.

The Lex Utils objecst are not thread safe, but they are small and one can be created on each thread to support multithreaded environments.

Language Detection

LexLanguageUtilities provides the ability to classify text into one of the languages supported by Salience. Language Detection only works for supported languages, content in other languages will not be correctly identified.

First, a LexLanguageUtilities object must be created:

.NET new LexLanguagelUtilities(string dataPath)
Java new LexLanguageUtilities(string dataPath)
Python session = openLanguageSession(string dataPath)
C lxaOpenLexLanguageSession(char* dataPath, LexLanguageSession ** ppSession)

If you wish to perform language detection on a machine that will not have Salience installed, please contact Lexalytics Support to obtain a languages.bin file that can be provided to all constructors in replacement of the data directory path. Once you have languages.bin, just give the full path to that file in replacement of the path to the salience data directory.

Once a session has been opened, a LanguageRecommendation object can be obtained for any text:

.NET LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)
Java LanguageRecommendation LexLanguageUtilities.GetLanguage(string text)
Python dict getLanguage(LanguageSession session, string text)
C lxaGetLanguage(LexLanguageSession* pSession, char* acText, lxaLanguageRecommendation** pResults)

The results are split into a best match and a list of how each possible language scored. Each language is provided as an (internal) code number, a language string, the score for that language, and what the optimal score for this text would have been. Text that scores very low compared to the optimal may be gibberish or an unsupported language. If you have text in multiple languages, you will get multiple language results with similar scores, with the ratio of score to perfect score approximating the ratio of each language.

.NET

LanguageResult

nLanguageCode internal code representing this language
sLanguageName name of the language
fScore measure of how likely it is that this is the language in question.
fPerfectScore the score to compare fScore against

LanguageRecommendation

Recommendation A single LanguageResult object representing the most likely language
vAllResults A vector of LanguageResults for all languages considered

Java

LanguageResult

nLanguageCode internal code representing this language
sLanguageName name of the language
fScore measure of how likely it is that this is the language in question
fPerfectScore the score to compare fScore against

LanguageRecommendation

BestLanguage A LanguageResult object representing the most likely language
vAllResults A vector of LanguageResults for all languages considered

Python

bestMatch Language code for recommended language
bestMatchName Name of recommended language
bestMatchScore Score for recommended language
perfectScore Score to compare individual scores against
otherLanguages List of all language results
otherLanguages\language Language code for other languages
otherLanguages\languageName Name of other languages
otherLanguages\score Measure of how well the other languages matched.

C

lxaLanguageRecommendation

nRecommended language code for recommended language
fRecommendedScore score for the recommended language
nChoices number of languages considered, also the length of the arrays
pPossibleLanguages array of language codes for each language considered
pLanguageScores @@array of language scores for each language, corresponding to indices in pPossibleLanguages
fWordCount Total words and bigrams considered, for use in interpreting scores

C provides an additional function:

lxaGetLanguageName(int nLanguageCode, char** acOutText);

to transform a language code to its name

Html Extraction

LexHtmlUtilities removes html tags from a document, and attempts to strip out unrelated content like ads and sidebars. Stripping out unrelated content will sometimes remove some of the article text, but not significant portions. This is particularly noticable if you provide non-html content: in that case you'll get the text back with some sentences removed for being 'off topic', so separating out html and non-html content before using the html extractor is recommended.

First, a LexHtmlUtilities object must be created:

.NET new LexHtmlUtilities(string dataPath)
Java new LexHtmlUtilities(string dataPath)
Python session = openHtmlSession(string dataPath)
C lxaOpenLexHtmlSession(char* dataPath, LexHtmlSession ** ppSession)

Then simply pass in content to the extraction function to get stripped text back:

.NET string LexHtmlUtilities.StripHtml(string htmlText)
Java string LexHtmlUtilities.ExtractTextFromHtml(string htmlText)
Python string extractText(HTMLSession session, string htmlText)
C lxaExtractTextFromHtml(LexHtmlSession* pSession, char* acInText, char** acOutText)