Near Duplicate Detection

Near duplicates are documents that are very similar to the current one. The system uses a sophisticated algorithm to determine near duplicates.

The differences between near duplicates may consist in changes in metadata, content, formatting, numbers, typos and other small text snippets.

There is a specific number treatment for near duplicates that you should be aware of: Numbers are replaced by a generic NUMBER marker so that documents with nearly identical text, but very dissimilar numbers are still considered near duplicates. In contrast, two documents with nearly identical text where one in addition contains numbers are probably not considered as near duplicates.

Near duplicate detection takes place during publishing.