Tokenization

Tokenization is the process of splitting a text into words.

Important: Recommind’s applications rely on tokenization to use the default settings. Never change any of them without discussing your use case with Customer Support first.

During data loading, text in white space separated scripts is split into words at the white space. Chinese and Japanese scripts as well as the Korean Hanja script do not separate words with white space. CORE cannot identify the words in these scripts. Text in these scripts is split into the characters, that is, each character becomes one word.

Additionally, words are split at so-called delimiter characters, for example comma or hyphen. The delimiter characters themselves are not stored as words. Therefore, you cannot search for delimiter characters. The set of delimiter characters is configurable.

Words containing numbers are split into fragments consisting of letters or numbers only. This allows to store content more efficiently. For example, abc123 is split into abc and 123.

The words in search queries are split likewise. The remaining fragments are converted into a directed proximity search if the fragment contains a wildcard operator or a phrase search otherwise.

Example:  

Query: *brown@recommind.com

Removing the delimiter characters, this is split into the three fragments:

*brown
recommind
com

The fragment with the wildcard is expanded, for example, to mrbrown OR brown. Together, the query is then processed as:

(mrbrown OR brown) pre/1 "recommind com"

Special delimiter characters

Because a phrase search takes more time than a plain keyword search, a sub-set of the delimiter characters can be declared special, for which the query is not turned into a phrase or directed proximity search, respectively.

Example:  

For example, the apostrophe is a special delimiter character by default. If you search for:  

i don't like monday's meetings

the apostrophes are removed, but not turned into phrase queries. The query is processed as:  

i don t like monday s meetings

If you removed the apostrophe from the list of special delimiter characters, the same query would be processed as:

i "don t" like "monday s" meetings

Example:  

The term ID12345 is treated as two words, ID and 12345. As such, searching for the term ID12345 will always search for documents that contain ID and 12345, either combined without any space, or separated by only one space or a delimiter character, such as a comma. The default treatment of alphanumerical terms reduces the memory usage for searches and thus makes them quicker. However, the splitting of alphanumerical terms it is not always desired.

Deactivating the splitting of alphanumerical words will ensure that the letter-number combinations remain intact. Then the term ID12345 is treated as one word. Searching for ID12345 will only find documents that contain the complete term.