Junk Detection Introduction

During a data source crawl and during indexing, the system can optionally detect documents that possibly contain junk content.

When an index engine indexes junk content as plain text, this quickly leads to many unique words, which blows up the word map considerably, impairing overall performance. To avoid this, you can use different types of junk detection:

  • Crawler junk detection detects binary data. It does not apply to CSV data load. Crawler junk detection replaces the document with an exception document.
  • Index engine junk detection detects unusually long strings that potentially are binary data or terms that are not known to a junk detection dictionary. This type of junk detection can be applied to all files that are loaded, including CSV data load. Index engine junk detection allows full indexing of all document parts that are not detected as junk. Files erroneously marked as junk can be reloaded using exception resolution. Exception resolution automatically disables this type of junk detection, and documents are loaded without junk detection exceptions. Index engine junk detection consumes more memory than crawler junk detection.

Junk detection marks detected documents as exception documents. You can filter for them with the Exception Type and Exception Class Smart Filters.

Junk detection can be enabled for data loading or for publishing. For best system performance, we recommend to apply junk detection when loading data.