Duplicate Detection

Duplicate detection is the automatic detection of identical documents.

Identical documents are called duplicates. The system detects them by comparing hash values that are computed during data loading. For electronic communication items, such as emails, chats, calendar entries, contacts etc., hash values are calculated from specific fields. For other documents, duplicate computation is based on MD5 hash values.

By default, duplicates are detected for the complete document collection. But duplicate detection can also be restricted to documents that belong to the same custodian.

There are two types of duplicate detection.

  • In Axcelerate Ingestion, the system detects duplicates only if they have the same formal characteristics. If a document is a standalone document, its duplicate must also be a standalone document. If a document is part of a family, its duplicate must also be part of a family. Attachments and embeddings inherit the duplicate status from the parent document of their family. Users initiate duplicate detection.
  • During publishing or matter export, the system automatically detects documents as duplicates without regard to whether a document is a standalone document or part of a family. Only hash values must be the same.

Note: For new projects, it is recommended to normalize date fields. Normalized dates allow to identify more duplicates in electronic communication items.

 

Normalize fields for de-duplication

Basic Hash Computation Fields