Supported Languages Limitations
Japanese syllabic characters: Half-width kana characters are not correctly indexed
For example,
anata written in kana:
あなた
is correctly tokenized and indexed, i.e. each character is
indexed as one word.
The same item written in half width kana :
アナタ is indexed as one word, although there are three
characters.
Japanese logographic characters:
IDEOGRAPHIC HALF FILL SPACE characters are not treated as
whitespace
If kanji is written with
IDEOGRAPHIC HALF FILL SPACE characters, these are indexed,
too.
For example,
anata written in kanji
貴方
is correctly tokenized and indexed.
The same item written in kanji, but with
IDEOGRAPHIC HALF FILL SPACE inserted between kanji characters
貴〿方
is tokenized and indexed as three characters.