OCR (Image Text Extraction)

- CORE Administrator
If the HTML or Text view of a document contains no text, there is often a simple reason: The native document is an image or a PDF file that contains only images. Or there is no native document at all, but only an image ingested with CSV load. In such cases, you can try to extract searchable text from the native document or image using OCR.
The OCR function is available in CORE Administration.
Important: If you want the system to use OCR text for learning and phrase detection, run OCR before publishing documents.
Caution: The text replaced with OCR text cannot be restored.

In most cases, OCR applies to documents for which exceptions occurred during ingestion.
Only documents with these exception classes are excluded from OCR:
- File format
- Archive
- Password protected/encrypted
To identify these documents, use the Exception Class Smart Filter.

-
Certain indexed documents do not contain any document text, but only metadata. However, there may be text inside the native documents that are image files, that simply could not be retrieved during data loading. You may want to use this text for a matter.
- You recognize that files that usually contain text, are mostly composed of images.
- After data loading from a CSV data source, if the CSV file references image files.
- After a CSV Merge that updates images.