German language support in Salience

Along with English, German is an Indo-European Germanic language. As such, many of the same tokenization and core NLP approaches are valid between the two. However, as with any language, German does have certain unique aspects that need to be considered to provide a true natural language approach.

Release history

Support for German is distributed as a separate data directory download from the Customer Support portal.

Release Release Date Notes
r7302 28 Aug 2013 Update to remove invalid English data file from chainer directory, resulting in improved summaries, increased theme topics, and improvement in document sentiment.
r6877 01 Feb 2013 Update to enhance phrase-based sentiment analysis. Expanded intensifiers, negators, and built-in phrase dictionary.
r6752 14 Dec 2012 Update to address line-ending incompatibility in pattern files on Linux.
r6518 03 Jun 2012 Initial release. Requires Salience 5.1.

Back to top

Capitalization

One of the main differences that one notices about German is that more words are capitalized than in English. This creates a problem when attempting to identify proper nouns, as capitalization is a major clue for distinguishing proper nouns in English (and many other languages). Consider the following example:

Die bisherigen Überlegungen sehen vor, ein Museum zum Kalten Krieg in einem Bürokomplex entstehen zu lassen, der auf einer noch unbebauten Fläche am Checkpoint Charlie entstehen soll.

Amongst other words, Bürokomplex might be referring to a specific office complex, but it's not a specifically named place and thus not a proper noun/named entity.

The approach to addressing this aspect of German is not unique to the language. Throughout our support for multiple languages, rather than relying on massive dictionaries of all the possible common nouns, verbs, and other parts of speech, Salience uses a model-based approach where a large corpus of words are annotated with their parts-of-speech in context. The result, in German, is that the part-of-speech tagger is more conservative in it's tagging of proper nouns on the basis of simply capitalization.

Back to top