Japanese language support in Salience

Support in Salience for Japanese rounds out our support for "CJK" languages; (Chinese, Japanese, and Korean).


Release history

Support for Japanese is distributed as a separate data directory download from the Customer Support portal.

Release Release Date Notes
r8118 01 Sep 2014 Initial release. Requires Salience 5.2.0.7881.

Back to top

Writing systems

Japanese is an East Asian language spoken primarily in Japan, where it is the national language. It has four writing systems: kanji, hiragana, katakana, and romaji.

Kanji (Japanese, and occasionally other East Asian words) and Katakana (foreign words and onomatopoeia) are used for most meaning-carrying pieces of the language—nouns, adjectives, verbs, etc.

Hiragana is used for all syntactic information—tense information on verbs and i-adjectives, all particles denoting grammatical function of words, etc.—as well as for words that have no kanji representation.

Romaji (or Roman characters) are used for abbreviations and occasionally foreign words.

Both Arabic and Kanji numerals are used in writing, though lately Arabic numerals have been used preferentially, especially for dates.

Back to top

Parts of speech

Nouns do not take articles, have no grammatical gender, and do not change for number, though some nouns can be pluralized by taking a collective suffix such as たち or ら.

高橋 教授
takahashi atsushi kyouju ra
Prof. Takahashi Atsushi & his group

Other nouns can be pluralized by repetition (e.g., ひと means person and ひとびと means persons.) Written in kanji, the repetition mark, 々, is used (e.g., 人, 人々).

Nouns are usually written in kanji, although borrowed nouns are written in katakana, and some words are only ever written in hiragana, though that is more rare.

Nouns may become verbs or adjectives with the appropriate suffixation.

They can become verbs with the affixation of the verb ‘to do’:

勉強 する
benkyou suru
to study (lit. to do study)

They can become adjectives by the addition of the particle の:

建物
ki no tatemono
wooden building

Pronouns as such do not exist, although there are some pronominal words. However, these pronominal words aren’t grammatically distinct from other nouns (for instance, they may take adjectives).

NA-adjectives are adjectives that are grammatically similar to nouns and, like nouns, are usually written using kanji, with very few exceptions. They modify nouns when followed by the particle な:

元気
genki na hito
healthy/energetic person

These adjectives may become adverbs with the addition of the particle に:

元気 走る
genki ni hashiru
to run energetically

I-adjectives are grammatically similar to verbs. They inflect for tense, politeness, and honorific speech and, unlike NA-adjectives, always end in –ai, –ii, –oi, or –ui, the i part of which is always in hiragana. These adjectives can become adverbs through inflection—the い becomes a く,

すごく 大きい
sugoku ookii kuma
amazingly large bear

Verbs come in three kinds: ichidan, godan, and fukisoku (irregular). All verbs inflect for tense, negation, mood, aspect, politeness, and honorific speech. Conjugation of verbs is actually very regular, even for the so-called irregular verbs. The two tenses are perfective (often considered past tense) and non-past. The two most common formality levels are plain and polite. Of those two, it looks like plain is most common in both news articles and tweets.

Form Endings
vowel-stem verbs (ichidan) All of these end with (い)る or (え)る, but some with that ending are consonant stem verbs.
consonant-stem verbs (godan) End in う, く, ぐ, す, つ, ぬ, ぶ, む or る.
irregular verbs Only two verbs: する (suru, to do) and 来(く)る (kuru, to come).

Particles, also called postpositions, show noun case—subject, object, indirect object, tool, etc. Particles follow the noun they modify.

  • wa (は, topic)
  • ga (が, subject)
  • o (を, direct object)
  • no (の, possession)
  • ni (に, indirect object marker, to, in, etc.)
  • kara (から, from)
  • made (まで, until, as far as)
  • de (で, using, at)

Some particles are used after sentences instead:

  • ka (か, question marker)
  • yo (よ, exclamatory marker)
  • to (と, quotation marker)

Particles also act to turn nouns and NA-adjectives into verbs and adverbs, as noted above.

Back to top

Specific POS tags

Certain part-of-speech (POS) tags were added to the tagset used by Salience to accommodate the needs of support for Japanese.

JJX
Adjectival suffix. Most of the words that are taken to be adjectives in Japanese are either variants of verbs or nouns; this suffix causes stems to semantically function as adjectives.
NNX
Dependent noun, which acts like a suffix in that it affixes to the end of a noun stem and cannot stand on its own, but is semantically like a full NN (e.g., 溶岩_NN流_NNX ‘lava flow’).
PREFIX
An affix placed at the beginning of a word that modifies the stem in the following ways:
* Contributes additional meaning (e.g., 反, similar to English ‘anti-‘).
* Honorific prefixes: ご signals that a word (typically a noun) should be read using its Chinese pronunciation, whileお signals that a word (typically adjectives and verbs) should be read using its traditional Japanese pronunciation. Both are used as polite markers.
* み is rare, and applies to words pertaining to Shinto or the emperor; it is sometimes used to express things poetically.
* Negation:
* 未 indicates that something has not yet happened or that something will happen in the future (未払い ,‘unpaid’)
* 不 conveys polarity (e.g., 不可能, ‘impossible’)
* 無expresses nonexistence or absence (e.g., 無理, ‘unreasonable’)
* 非 expresses a meaning similar to that of ‘non-‘ (e.g., 非政府).
PUNC
Interpunct, or katakana middle dot, is a dot used for inter-word separation. It is used in various ways, including, but not limited to:
* To separate listed items, instead of a comma (e.g., 小・中学校)
* To separate foreign words and names when written in katakana (e.g., : パーソナル・コンピューター, ‘personal computer,’ アルベルト・アインシュタイン , ‘Albert Einstein’)
* To separate titles, names, and positions (部長補佐・鈴木, ‘Assistant Vice President Suzuki’).

The full set of Salience POS tags can be found on the page Supported part-of-speech tags.

Back to top

Syntax

Japanese is a SOV (Subject-Object-Verb) language, what is referred to as head-final language in linguistics. Due to the existence of particles, Japanese word order is relatively flexible, but sentences generally have the following structure:

Sentence Topic (or Subject) – [Time – Location – Subject (or Topic)] – Indirect Object – Direct Object – Verb

Japanese sentences are highly sensitive to context; thus, words or phrases obvious to both the speaker and the listener are often omitted, leading to such minimalist sentences as:

遅い osoi (you’re) late.

Japanese also differs by formality. Thus, the same sentence can look very different depending on the level of formality/politeness:

This is a book.

informal polite formal polite formal
これは本だ これは本です これは本である これは本でございます
kore wa hon da. kore wa hon desu. kore wa hon de aru. kore wa hon de gozaimasu.

This variance in sentence content and structure results in the following difficulties in sentiment analysis:

  1. If all but the verb is missing from a sentence, it is hard to tell what sentiment it carries
  2. Polite forms may seem to be more positive, since a polite speaker is trying to beautify a sentence for the listener.

Back to top

Named entities

Person names come in two forms—foreign names and Japanese names. Foreign names are almost always written in Katakana and frequently have a ・ between the given name and the surname. Sometimes foreign names are written in Romaji, however. It is rare that foreign names have honorific suffixes like さんand氏.

ダニエル・ラドクリフ Daniel Radcliffe

Japanese names are almost always written in kanji, although occasionally the first name is written in hiragana. Sometimes, nicknames are written in katakana, although this is rare. The surname is always first in Japanese names.

安倍晋三 Abe Shinzo
亀梨かずや Kamenashi Kazuya
マサ斎藤 MASA Saitou (MASA, written in Katakana, is his ringname.)

Company names are significantly more variable. Companies' full names (including their company designation) are occasionally used, but more often the names are shortened to the greatest extent possible without impeding understanding. Thus, Tokyo Electric Power Company can be seen in newspaper articles as: 東京電力株式会社, 東京電力会社, 東京電力, 東電, 東電会社, or TEPCO. Toyota can be written asトヨタ自動車株式会社, トヨタ自動車, トヨタ, TOYOTA, or even TMC.

Company names occasionally occur in quotation marks, e.g. 「BスカイB」(British Sky Broadcasting)

Place names also show great variability. Foreign names generally are written in Katakana, although some foreign places, such as the US, have Japanese names and are written in kanji. As with company names, place names can also be shortened to the greatest extent possible without impeding understanding.

Thus, the United States of America can be アメリカ合衆国, アメリカ, 米国, 合衆国, or just 米.

Japanese place names are almost always written with kanji, and frequently have the type of place (prefecture, city, ward, village) included in the name, e.g. 広島県, hiroshima-ken, Hiroshima prefecture. That said, Japanese’s context-sensitivity holds true here as well. A place whose place type was previously mentioned, may be mentioned without it in future sentences, e.g. 広島, Hiroshima.

Places like museums or power plants are almost always in kanji (or, if non-Japanese, katakana) and almost always have the word for museum or powerplant suffixed to the place’s name, e.g. 近代美術館, kindaibijutsukan, Museum of Modern Art or高浜原発, takahamagenpatsu, Takahama Nuclear Power Plant.

Product names have great variability. One constant is that they frequently (though not always) appear in quotation marks:

「アサヒドライゼロ」
「AdSmart」
「アサヒスーパードライ」
「スカイ+HD」

Back to top

Sentiment

In Japanese, sentiment is most likely to be revealed in the adjectives and adverbs (which are constructed from adjectives) of a phrase. Nouns and verbs can also lend sentiment to a sentence, but in the majority of cases they tend to be fairly neutral.

Back to top