Splitting Examples
To help you understand how splitting affects documents and their attachments/embeddings, below are examples of document handling based on different split settings.
Searchable documents may also contain embedded non-searchable images, such as screen captures. Splitting off embeddings allows you to determine which embeddings are candidates for optical character recognition (OCR).
By default, images of different sizes exist as separate Embedding types:
- icons are images smaller than 10 KB;
- smallIcons are images smaller than 7 KB; and
- largeImages are images larger than 10 KB.
Split rules are then defined for each, so an image of a specific size can be handled differently depending on the Container type it is associated with. For instance, by default, when contained in Microsoft Office items, images smaller than 10 KB are ignored, while images larger than 10 KB are not split.
You can change how images of different sizes are handled. These settings are located on the Embeddings node of the data source configuration.
You have a Microsoft Outlook email with two attachments. If you disable splitting for Outlook email, you end up with this result:
The email has the storage type File. The attachments have no storage type because they were not split. The text of the attachments is included in the email text and indented.
Outlook email with attachments; splitting disabled
If no changes are made to the default settings for how Microsoft Word documents are handled, a Word document that contains an embedded Microsoft Excel sheet, some images that are all smaller than 7 KB, and an image that is larger than 10 KB will be treated like this:
- Default splitting applies to the embedded Excel sheet. This means the Excel sheet is split off, indexed as a separate document and it becomes part of the Word document's family. In this case, splitting off the embedding makes sure the Excel spreadsheet receives full review, so hidden content is not missed.
- The split rule for icons applies to the small images because they are smaller than 7 KB. This means these images are ignored.
- The split rule for largeImages applies to the image that is larger than 10 KB. This means the image is not split, but is kept in the Word document.
Search results differ depending on whether documents are split. For example, the subject of an email contains the word tomato and its attachment contains content about soup. You search for tomato soup.
- If attachments are split from emails (default setting):
Neither the email nor the attachment are returned as part of the search result because neither contain both words: tomato and soup.
- If attachments are not split from emails:
The email is returned as part of the search result because the content of the attachment was added to the index of the email.