Fast and Low-memory Ingestion: Metadata Index

With Metadata Index, you can ingest documents with as small memory, storage and hard disk usage as possible, within a minimum of time. Full processing is done during the publish.

Metadata Index is useful if you have to ingest high data volumes, but expect that only a part of them will be subject to review, and no text-based culling is required to identify documents to be published.

Then it does not make sense to fully index document content. Instead, you can postpone a big part of document processing to the publish, for only those documents that will be subject to review.

For Metadata Index, you need to create a special application type, called Metadata Only Ingestion. For this application, you create a file system data source and configure filters, if needed.

Once you ingested documents, you can do some early case assessment, based on document metadata, for example, delete documents or define some saved searches for relevant documents. You can also run exception resolution.

Ingestion results include:

  • standalone documents and their metadata
  • documents extracted from email archives, including their metadata and attachment family ID, but no extraction of attachments

  • stub documents representing filtered documents, to guarantee there is no information lack for chain of custody:
    • stub documents for documents that are ignored due to filter settings in the data source configuration, such as NIST filters or MIME type filters. These placeholder documents only exist in the index. The original file is not copied.
    • stub documents for archive files (if indexing archive files is enabled). These placeholder documents only exist in the index. The original file is not copied.
  • Hash value calculation for indexed documents
  • MIME type detection
  • Native file copies, except for filtered documents

Ingestion results do not include:

  • Attachment splitting

    All attachments are still contained in the native file of the root document.

  • Embedding processing

    All embeddings are still contained in the native file of the root document.

  • Searchable content, that is, document text.

    The document text is still in the native files.

When ingestion is done, you can set up the matter and publish documents. Publishing includes attachment and embedding processing, language detection etc., that is, anything needed to be able to fully review the documents. Compared to a publish that follows an ingestion with full document processing, such a publish requires approximately 30% more time. But as you only publish a subset of documents, the overall time saving is still high.