Tokenization
Tokenization is the process of splitting a text into words.
Important: Recommind’s applications rely on tokenization to use the default settings. Never change any of them without discussing your use case with Customer Support first.
During data loading, text in white space separated scripts is split into words at the white space. Chinese and Japanese scripts as well as the Korean Hanja script do not separate words with white space. CORE cannot identify the words in these scripts. Text in these scripts is split into the characters, that is, each character becomes one word.
Additionally, words are split at so-called delimiter characters, for example comma or hyphen. The delimiter characters themselves are not stored as words. Therefore, you cannot search for delimiter characters. The set of delimiter characters is configurable.
Words containing numbers are split into fragments consisting of letters or numbers only. This allows to store content more efficiently. For example, abc123 is split into abc and 123.
The words in search queries are split likewise. The remaining fragments are converted into a directed proximity search if the fragment contains a wildcard operator or a phrase search otherwise.
Example:
Query: *brown@recommind.com
Removing the delimiter characters, this is split into the three fragments:
- *brown
- recommind
- com
The fragment with the wildcard is expanded, for example, to mrbrown OR brown. Together, the query is then processed as:
(mrbrown OR brown) pre/1 "recommind com"
Special delimiter characters
Because a phrase search takes more time than a plain keyword search, a sub-set of the delimiter characters can be declared special, for which the query is not turned into a phrase or directed proximity search, respectively.
Example:
For example, the apostrophe is a special delimiter character by default. If you search for:
i don't like monday's meetings
the apostrophes are removed, but not turned into phrase queries. The query is processed as:
i don t like monday s meetings
If you removed the apostrophe from the list of special delimiter characters, the same query would be processed as:
i "don t" like "monday s" meetings
Example:
The term
ID12345 is treated as two words,
ID and
12345. As such, searching for the term
ID12345 will always search for documents that contain
ID and
12345, either combined without any space, or separated by only
one space or a delimiter character, such as a comma. The default treatment of
alphanumerical terms reduces the memory usage for searches and thus makes them
quicker. However, the splitting of alphanumerical terms it is not always
desired.
Deactivating the splitting of alphanumerical words will ensure that
the letter-number combinations remain intact. Then the term
ID12345 is treated as one word. Searching for
ID12345 will only find documents that contain the
complete term.