Duplicate Detection
Duplicate detection is the automatic detection of identical documents.
Identical documents are called duplicates. The system detects them by comparing hash values that are computed during data loading. For emails, you can define which fields are used for hash value computation.
By default, duplicates are detected for the complete document collection. But duplicate detection is also possible for documents that belong to a specific custodian.
There are two types of duplicate detection.
- InAxcelerate Ingestion, the system detects duplicates only if they have the same formal characteristics. If a document is a standalone document, its duplicate must also be a standalone document. If a document is part of a family, its duplicate must also be part of a family. Attachments and embeddings inherit the duplicate status from the parent document of their family. Users initiate duplicate detection.
- During publishing or matter export, the system automatically detects documents as duplicates without regard to whether a document is a standalone document or part of a family. Only hash values must be the same.
Note: For new projects, it is recommended to normalize date fields. Normalized dates allow to identify more duplicates in electronic communication items.