Support for Japanese is distributed as a separate data directory download from the Customer Support portal.
|r8118||01 Sep 2014||Initial release. Requires Salience 184.108.40.20681.|
Japanese is an East Asian language spoken primarily in Japan, where it is the national language. It has four writing systems: kanji, hiragana, katakana, and romaji.
Kanji (Japanese, and occasionally other East Asian words) and Katakana (foreign words and onomatopoeia) are used for most meaning-carrying pieces of the language—nouns, adjectives, verbs, etc.
Hiragana is used for all syntactic information—tense information on verbs and i-adjectives, all particles denoting grammatical function of words, etc.—as well as for words that have no kanji representation.
Romaji (or Roman characters) are used for abbreviations and occasionally foreign words.
Both Arabic and Kanji numerals are used in writing, though lately Arabic numerals have been used preferentially, especially for dates.
Nouns do not take articles, have no grammatical gender, and do not change for number, though some nouns can be pluralized by taking a collective suffix such as たち or ら.
|Prof. Takahashi Atsushi & his group|
Other nouns can be pluralized by repetition (e.g., ひと means person and ひとびと means persons.) Written in kanji, the repetition mark, 々, is used (e.g., 人, 人々).
Nouns are usually written in kanji, although borrowed nouns are written in katakana, and some words are only ever written in hiragana, though that is more rare.
Nouns may become verbs or adjectives with the appropriate suffixation.
They can become verbs with the affixation of the verb ‘to do’:
|to study (lit. to do study)|
They can become adjectives by the addition of the particle の:
Pronouns as such do not exist, although there are some pronominal words. However, these pronominal words aren’t grammatically distinct from other nouns (for instance, they may take adjectives).
NA-adjectives are adjectives that are grammatically similar to nouns and, like nouns, are usually written using kanji, with very few exceptions. They modify nouns when followed by the particle な:
These adjectives may become adverbs with the addition of the particle に:
|to run energetically|
I-adjectives are grammatically similar to verbs. They inflect for tense, politeness, and honorific speech and, unlike NA-adjectives, always end in –ai, –ii, –oi, or –ui, the i part of which is always in hiragana. These adjectives can become adverbs through inflection—the い becomes a く,
|amazingly large bear|
Verbs come in three kinds: ichidan, godan, and fukisoku (irregular). All verbs inflect for tense, negation, mood, aspect, politeness, and honorific speech. Conjugation of verbs is actually very regular, even for the so-called irregular verbs. The two tenses are perfective (often considered past tense) and non-past. The two most common formality levels are plain and polite. Of those two, it looks like plain is most common in both news articles and tweets.
|vowel-stem verbs (ichidan)||All of these end with (い)る or (え)る, but some with that ending are consonant stem verbs.|
|consonant-stem verbs (godan)||End in う, く, ぐ, す, つ, ぬ, ぶ, む or る.|
|irregular verbs||Only two verbs: する (suru, to do) and 来(く)る (kuru, to come).|
Particles, also called postpositions, show noun case—subject, object, indirect object, tool, etc. Particles follow the noun they modify.
Some particles are used after sentences instead:
Particles also act to turn nouns and NA-adjectives into verbs and adverbs, as noted above.
Certain part-of-speech (POS) tags were added to the tagset used by Salience to accommodate the needs of support for Japanese.
The full set of Salience POS tags can be found on the page Supported part-of-speech tags.
Japanese is a SOV (Subject-Object-Verb) language, what is referred to as head-final language in linguistics. Due to the existence of particles, Japanese word order is relatively flexible, but sentences generally have the following structure:
Sentence Topic (or Subject) – [Time – Location – Subject (or Topic)] – Indirect Object – Direct Object – Verb
Japanese sentences are highly sensitive to context; thus, words or phrases obvious to both the speaker and the listener are often omitted, leading to such minimalist sentences as:
Japanese also differs by formality. Thus, the same sentence can look very different depending on the level of formality/politeness:
This is a book.
|kore wa hon da.||kore wa hon desu.||kore wa hon de aru.||kore wa hon de gozaimasu.|
This variance in sentence content and structure results in the following difficulties in sentiment analysis:
Person names come in two forms—foreign names and Japanese names. Foreign names are almost always written in Katakana and frequently have a ・ between the given name and the surname. Sometimes foreign names are written in Romaji, however. It is rare that foreign names have honorific suffixes like さんand氏.
Japanese names are almost always written in kanji, although occasionally the first name is written in hiragana. Sometimes, nicknames are written in katakana, although this is rare. The surname is always first in Japanese names.
|マサ斎藤||MASA Saitou (MASA, written in Katakana, is his ringname.)|
Company names are significantly more variable. Companies' full names (including their company designation) are occasionally used, but more often the names are shortened to the greatest extent possible without impeding understanding. Thus, Tokyo Electric Power Company can be seen in newspaper articles as: 東京電力株式会社, 東京電力会社, 東京電力, 東電, 東電会社, or TEPCO. Toyota can be written asトヨタ自動車株式会社, トヨタ自動車, トヨタ, TOYOTA, or even TMC.
Company names occasionally occur in quotation marks, e.g. 「ＢスカイＢ」(British Sky Broadcasting)
Place names also show great variability. Foreign names generally are written in Katakana, although some foreign places, such as the US, have Japanese names and are written in kanji. As with company names, place names can also be shortened to the greatest extent possible without impeding understanding.
Thus, the United States of America can be アメリカ合衆国, アメリカ, 米国, 合衆国, or just 米.
Japanese place names are almost always written with kanji, and frequently have the type of place (prefecture, city, ward, village) included in the name, e.g. 広島県, hiroshima-ken, Hiroshima prefecture. That said, Japanese’s context-sensitivity holds true here as well. A place whose place type was previously mentioned, may be mentioned without it in future sentences, e.g. 広島, Hiroshima.
Places like museums or power plants are almost always in kanji (or, if non-Japanese, katakana) and almost always have the word for museum or powerplant suffixed to the place’s name, e.g. 近代美術館, kindaibijutsukan, Museum of Modern Art or高浜原発, takahamagenpatsu, Takahama Nuclear Power Plant.
Product names have great variability. One constant is that they frequently (though not always) appear in quotation marks:
In Japanese, sentiment is most likely to be revealed in the adjectives and adverbs (which are constructed from adjectives) of a phrase. Nouns and verbs can also lend sentiment to a sentence, but in the majority of cases they tend to be fairly neutral.