How to help AI get the most from legacy archives

AI applications for business intelligence processing can extract information from unstructured “born digital” documents, but many archives include years (or decades) of low quality PDF files and older scanned documents that contain important business information.

About the author: Thomas Zellmann has been working in electronic data processing (EDP) for more than 30 years and has extensive experience with classic and modern IT solutions. Prior to joining LuraTech/Foxit in 2001 he worked … Read more

Thomas Zellmann
November 20, 2019

Article

Deutsche Version verfügbar

Contents

How to help AI get the most from legacy archives

There's an accelerating trend towards using Artificial Intelligence (AI) solutions to improve business processes related to existing data and documents. Modern AI algorithms offer many possibilities but need good data to achieve meaningful results. Solutions exist for both scanned documents and digital documents that lack extractable text and relevant document structures such as headings, lists and tables.

Image-based content

Screenshot of a redacted page showing an email address missed by OCR software. AI applications can extract and utilize data from well-made but unstructured "born digital" documents comparatively easily. However, many archives include decades of older scanned documents, and there's always incoming paper mail to be digitized on a daily basis.

Often, scans are stored in Adobe's TIFF format; a proprietary technology that retains documents as mere clouds of pixels, with no awareness of document characteristics. In other cases, existing digital content such as emails are partially destroyed by printing and scanning simply to send them into the same archival workflow as paper mail. This type of content presents major difficulties for AI processes.

Step 1 for so-called "pixel cloud" formats such as TIFF is optical character recognition (OCR), which makes these documents searchable through character and word recognition.

The marketplace - including many PDF Association members - offers many solutions for OCRing TIFFs and other formats resulting in conversion to searchable PDF files. When choosing an OCR solution the quality of recognition for the relevant type of content you process, is of course a decisive criterion.

Consider OCRing the "born digital" content too

In the context of AI applications whose scope is limited to letters and words, or if input files are of poor quality, it is sometimes useful to consider OCR processing even for "born digital" content. The end-product of OCR processing may be more reliable and plainer, and thus more appropriate for the AI's needs, than results garnered from attempts to extract text from the source PDF file.

Text alone is good, but text + structure is much better

By itself, OCR results in letters and words without structural information (headings, tables, lists) about the document. Document structure information can be included via the "tagged PDF" feature commonly used for accessibility to allow blind and other users with print disabilities to read.

Likewise, one can think of AI applications as "blind" in that they need some assistance in understanding the plain text stream extracted from documents. A simple example is a table. Without the information that establishes the table's columns and rows, neither a blind user not an AI user will understand the purpose of the content.

Some of our members' solutions support auto-tagging, wherein the product automatically recognizes as much structure as software can. As with OCR, auto-tagging alone doesn't result in 100% accuracy, but many structures, including tables, are often well recognized and can therefore be better handled by the AI.

Conclusion

Existing document collections in "pixel cloud" formats can deploy modern OCR and auto-tagging to prepare critical business information as "feedstock" for AI applications.

Featured articles

Discover pdfa.org

Key resources

Get involved

How to help AI get the most from legacy archives

Image-based content

Consider OCRing the "born digital" content too

Text alone is good, but text + structure is much better

Conclusion