Basic Hash Computation Fields

List of standard fields in electronic communication that can be used for hash computation.

Location: document model: Categorization > Content analysis > Deduplication Re-Hash > Deduplication per document type > <Type name>

Text type

The display name of the field that is used for email hash computation.

Allowed values: listed names

Default value:

None

Remove whitespace

Hash value computation must be invariant with respect to any whitespaces in the body and subject fields of emails. This does include normal spaces, tabs, newlines and linefeed. The industry specification for de-duplication indicates that all whitespaces should be removed from the Subject and Body fields.

If this check box is activated for a text type, whitespaces are ignored for hash value computation. By default whitespace removal is activated for the Subject and Body field. It makes sure that e.g. emails with subjects or bodies that only differ from each other by the number of whitespaces are assigned the same hash value and are seen as duplicates.

If the check box is deactivated, whitespaces are included in the hash value computation.

Note: Included whitespaces may falsify the hash value of emails.

Allowed values:

false
true

Default value:

depends on field

Content Extraction pattern

With a Perl regular expression you can specify which content is used for hash computation. Then only content matching the specified pattern is used for hash computation. If this field is left empty, the complete content is used for hash computation.

Allowed values: regular expression

Default value:

depends on field

Ignore

If the check box is activated, upper and lower cases are not considered for hash value computing.

If the check box is not activated, case differences may lead to different hash values.

Allowed values:

false
true

Default value:

true