To-may-to, to-mah-to. To Salience it’s all the same

November 15th, 2012 by Carl Lambrecht

Soon after we started releasing support for non-English languages, we started getting questions about what dialects of French or Spanish or Portuguese we were able to analyze. It makes sense, as there are distinct differences within these major languages spoken in different parts of the world. Even in English, you find subtle differences between “American English” and “UK English”. Luckily, for Salience, many of the differences have little impact and those that do can be easily addressed. Let’s look at where these differences appear.

“You say to-may-to, I say to-mah-to”

This is the main difference we come across when looking at a language such as English, Spanish, etc. There are more differences in the spoken language than in the written language.

As an example, in Madrid, the word for a town square is pronounced “la plah-tha”. But in Mexico City, the word for a town square is pronounced “la plah-sa”. Guess what, in both Madrid and Mexico City, the word is spelled “la plaza”. Because Salience is analyzing written text, the oral dialect which distinguishes most regional differences in the same language simply don’t come into play.

Color vs. Colour and Truck vs. Lorry

Where there is an apparent difference is in regional jargon. Take a random piece of news content from BBC and a random piece of news content from the New York Times. Both are in English, but strip away all the chrome and look at just the text, how do you tell which is US English and which is UK English? Slight differences in spelling and the odd word here or there. Repeat the same experiment with Portuguese gathered from O Globo versus Publico.

If part-of-speech tagging was performed on the basis of very large dictionaries, you’d find gaps in your ability to POS tag. But the POS tag model takes multiple context clues into account in determining the correct part-of-speech for each word. So luckily, even though our POS tagger has been trained on US English content and has likely never seen the word “lorry”, within the context of a sentence the model will still have clues that will indicate this is a noun. At a basic NLP level, Salience does not need to know that “lorry” is equivalent to “truck” in the States.

That said, when developing support for French, Spanish and Portuguese, we did gather training content from a variety of sources representing different global locations.

In cases where the local jargon can make a difference, the power of the Salience data directory comes into play. Recently, at the Text Analytics World conference, I sat in on a vendor talk that was analyzing tweets in Spanish, and they come across slang like “pajuo”, which is used in Venezuela to mean “silly”. Our Spanish POS model could identify this word properly as an adjective based on other contextual clues. But if a customer needed to ensure that, a simple addition to customlexicon.dat does the trick.

Conclusion

As our analysis of text continues to mature and expand to different languages and global markets, we’re keeping a close eye on various language aspects that could help or harm our ability for Salience to deliver accurate insight. One of our first updates to non-English language support extended the out-of-the-box data files for French to include a significant number of adjustments for the way that French is expressed on Twitter. The Salience architecture is flexible enough for either Lexalytics or customers to make these adjustments to accommodate the content they encounter.

Tags:

Comments are closed.