Data – Eyes and Ears of the AI

Carsten Luedtge // July 1, 2020

PDF in general PDF 2.0 PDF/A PDF/UA Member News

The global volume of data continues to grow strongly. Above all, unstructured data in the form of photos, audio files and videos as well as presentations and text documents will grow disproportionately - according to the market research institute IDC by an average of 62 percent annually. By 2022, this data type is expected to account for around 93 percent of the total volume.¹

According to a Gartner definition, unstructured data includes "all content that does not correspond to a specific, predefined data model. It's usually human-generated and person-related content that doesn't fit well into databases." But they often contain valuable customer and behavioural information, the evaluation of which can be important for well-founded decisions.

In addition, in-depth analysis of unstructured data forms the basis for better and expanded services, which can even lead to completely new business models. IDC expects companies that analyze all relevant data by 2020 to achieve a productivity gain of $430 billion over less analytically oriented competitors.²

Currently, companies are still looking for truly efficient solutions to convert unstructured data into structured data. They face a number of challenges, ranging from the question of geographic location, the type of data storage and governance, to securing and analyzing this information in local and cloud environments. So it is hardly surprising that the MIT Sloan Group classifies 80 percent of all data as untrustworthy, inaccessible or not analyzable. IDC estimates that by 2020 the "digital universe" will contain up to 37 percent of information that could be valuable if analyzed.³

Digitization Means Automation

One thing is certain: Structured and analyzable data are the basic prerequisite for the next stage of digitization in customer communication. This refers to the extensive automation and standardization of processes, so that "human intervention" is less and less necessary ("dark processing"). Routine tasks such as service invoicing, confirmation of address and tariff changes or appointment agreements are already taken over by software solutions, language assistants and chatbots based on AI algorithms (self-learning systems).

What's more, even content with a high creative share, such as technical essays and the like, will sooner or later be generated by AI systems. Already today there are programs that can produce simple Wikipedia articles with simple syntax and grammar. You define certain reference points (structure, keywords) (for a text about a city, for example, the number of inhabitants, year of foundation, town twinning, geographical data) and the system retrieves the necessary data from Wikidata, supplements the corresponding stored text modules, which follow a simple grammar (subject - predicate - object) and merges everything into a finished text.

Many still remember the appearance of Google CEO Sundar Pichai at the IO developer conference in May last year, when he introduced the language assistant "Duplex": The chatbot is able to telephone independently without the called person noticing that he is dealing with an "artificial intelligence".⁴

With other processes, on the other hand, such as the cancellation of an insurance policy or the release of an invoice for more than 50,000 euros, for example, it is certain - partly due to regulatory requirements - that a clerk will continue to look into it in the future. But it is only a matter of time before such sensitive areas are also automated. The more reliable the systems become, the higher the threshold for automated processing can ultimately be set. However, this requires correct handling of the data.

Harald Grumser, founder and CEO of Compart AG, puts it in a nutshell: "Digital processes need access to the content of documents, and artificial intelligence also needs eyes and ears. It is therefore becoming increasingly important to obtain the data required for automation right from the start, to provide it with a structure and to store it correctly."

Documents Are the Human-readable Representation of Data

That concerns also and exactly the document and output management as interface between classical (paper-bound) and electronic communication. Typically, digital data is converted into analog data on the output side (e.g. when printing, but also when transforming text content into audio files ("text-to-speech")).

On the other hand, there is the situation in the inbox (input management), where exactly the opposite happens: Analog data is converted into electronic documents (e.g. when scanning, but also when converting audio/video files into readable content) - albeit not necessarily in a very high-quality form.

The challenge now is to transform the information and data generated in all areas of inbound and outbound communication into a structured form and store it in the right "data pots" so that it is available for all processes of document and output management - from the capture of incoming messages (input management) to the creation and processing of documents and their output.

It is irrelevant on which digital or analog medium a document is sent or displayed: It is always about the data, because a document is ultimately only its respective representation in a form readable by humans - whereby a distinction must be made here between non-coded and coded documents

In this context, two major trends should be mentioned, which are becoming more and more important and have almost replaced other developments:

  • XML (Extensible Markup Language) as a markup language for complex, hierarchical data, and
  • JSON (JavaScript Object Notation) as a compact data format (similar to XML, only simpler), which today is mainly used in web services. (see also the glossary).

Both technologies have proven themselves for the description and definition of structured data and will certainly play an even greater role.

Data Must Be Checked, Transferred and Stored Correctly

To ensure that the structured data is actually available for automated processing, it is important that it is stored correctly. Here, non-relational databases such as NoSQL (including the subcategories Graph Database and RDF) now offer new possibilities. Their great advantage over relational databases is that they can manage data even in very complex contexts and thus enable very specific queries (see also the "Glossary").

One of the best known applications for this is Wikidata, the knowledge database of the online encyclopedia Wikipedia, in which tens of millions of facts are now stored. If, for example, you want to know how many Bundesliga players who were born in Berlin are married to Egyptian women, you will certainly find what you are looking for here. Certainly - a very unusual example, but one that makes the significance of the subject clear.

The aim is to gain new connections/knowledge from structured data about algorithms (ontologies). This is where artificial intelligence (AI) comes into play, which can then be used to formulate complex queries.

A further important topic in this context is that the stored data with a structure must be checked - something that is often not done today. The XML schema, for example, is a proven method for guaranteeing the correctness and completeness of an XML file. Errors caused by unchecked data can be very serious.

Consistent data verification is therefore essential. Last but not least, the data must also be converted into each other using rules. There are also many possibilities for this today, one of the best known is certainly the programming language XSLT. But there are also other sets of rules.

Instead of Destroying Content....

Anyone who wants to further increase the degree of automation of processes in customer communication in the sense of the next stage of digitization must ensure structured, consistent and centrally available data. For document and output management, this means preserving the content of documents as completely as possible right from the start instead of destroying it - as is often observed in the electronic inbox of companies, for example.

The problem here: In many companies, incoming e-mails are still "typed", i.e. converted into an image format, in order to subsequently make parts of the document content interpretable again by means of OCR technology. It's "Deepest Document Middle Ages." It wastes resources unnecessarily, especially when you consider that email attachments today can be quite complex documents with tens of pages.

Above all, however, this media discontinuity is tantamount to a "data gau": electronic documents (e-mails), which in themselves could be read and processed by IT systems, are first converted into TIFF, PNG or JPG files. So "pixel clouds" arise from content. In other words, the actual content is first encoded (raster images) and then made "readable" again with difficulty using Optical Character Recognition (OCR). This is accompanied by the loss of semantic structural information, which is necessary for later reuse.

How nice would it be, for example, if you could convert e-mail attachments of any type into structured PDF files immediately after receipt? This would lay the foundation for long-term, revision-proof archiving; after all, the conversion from PDF to PDF/A is only a small step.

...Rather Preserved Than the Basis for Further Automation

The following example: A leading German insurance group receives tens of thousands of e-mails daily via a central electronic mailbox, both from end customers and from external and internal sales partners. Immediately after receipt, the system automatically "triggers" the following processes:

  • Conversion of the actual e-mail ("body") to PDF/A
  • Individual conversion of the e-mail attachment (e.g. various Office formats, image files such as TIFF, JPG, etc.) to PDF/A
  • Merging of the e-mail body with the corresponding attachments and generation of a single PDF/A file per business transaction
  • At the same time, all important information is read from the file (extracted) and stored centrally for downstream processes (e.g. generation of reply letters on an AI basis, case-closing processing, archiving).

Everything runs automatically and without media discontinuity. The clerk receives the document in a standardized format, without having to worry about preparation (classification, making legible).

The insurer could still "split" the workflow into dark and interactive processing. During dark processing, every incoming e-mail plus attachment is automatically converted into a PDF/A file, transferred to the clerk and finally archived.

Interactive processing, on the other hand, involves the "intelligent" compilation of e-mail documents of different file formats into an electronic dossier (customer file/process). The clerk first opens the e-mail and the attachment on his mail client (Outlook, Lotus Notes, etc.) or his special clerking program and decides what needs to be edited. The normal workflow then applies as with dark processing: conversion - forwarding - processing - archiving.

The interactive variant is particularly useful if not all documents have to be archived. Modern input management systems are now capable of automatically recognizing all common formats of e-mail attachments and converting them into a predefined standard format (e.g. PDF/A or PDF/UA). And: You extract all necessary data from the documents at the same time and store them centrally.

Such scenarios can be implemented, for example, with systems such as DocBridge® Conversion Hub, whose linchpin is a central conversion instance. Its core is a kind of "dispatcher", which analyses every incoming message (e-mail, fax, SMS, messenger service, letter/paper) and automatically converts it into the optimal format for the document in question. How is the further processing to take place?) decides. DocBridge® Conversion Hub also includes an OCR function for extracting content and metadata (Optical Character Recognition).

¹ CIO online, 09/23/2019 („KI ebnet den Weg zu unstrukturierten Informationen“)
² Ebenda
³ Ebenda
⁴ The example of an agreement for a hairdresser's appointment showed the new dimension of intelligent speech systems such as "Duplex": Previous systems can usually be recognized as "robots" within a few words (unnatural sounding voice, wrong emphasis, choppy sentences, wrong or no response to requests). Not so AI tools of the new generation: They are quite able to capture content with complex syntax and "talk" so skilfully with people that they do not notice who or what their counterpart is.

See also https://www.compart.com/en-US/digital-inbox-inbound-communication


ABOUT THE AUTHORS

© 2020 Assosiation for Digital Document Standards e.V. | Privacy Policy | Imprint