RPA-based office workflows, using PDF

Dietrich von Seggern // August 12, 2020

PDF in general Article

Robots have already made their way in the manufacturing industry, where they are programmed in a way that they know exactly what they have to do, while the human presence is only there to control and monitor the automated processes. In order for this to work smoothly, the processes and materials to be processed have to be of exactly the same kind. In other words, they must be standardized. Only then, the 'colleague' robot can build the same part in the same place on a production line for e.g. a vehicle.

However, no robot is present at RPA in an everyday office life. They are software solutions that 'look over people's shoulders' and learn from their work processes, but that only works if the processes are standardized. To achieve this, the files to be processed should be as homogeneous as possible. In practice, however, this is not always easily achieved. After all, the goal is to process as many files as possible using the same automated processes, ideally also for externally supplied data. For this reason, it often makes sense to have a standardization (or in this case better a 'normalization') step for the data at the beginning of the process. This will often begin with format conversions, e.g. Office to PDF. In order to ensure a certain quality for all PDFs, it may also make sense to normalize them to PDF/A first.

We will all agree that PDF is a really good, if not the only sensible candidate for the basic format of file-related RPA processes - at least when provided from outside. But there are possible requirements that go beyond PDF, related to the quality of the PDF files. An example is scanned files that are saved as PDF. Without OCR, they are not full-text capable and therefore a pixel-filled image. As a rule, RPA applications cannot do much with them - except OCR them with the known limitations. In PDF, displaying the characters on the monitor, which is essential for electronic paper, is different from deriving the meaning of the characters (semantics). For example, to project a small 'c' onto the monitor, its shape in the respective font is required. However, in order for the 'c' to be found in the text search and correctly interpreted when copying, its semantic understanding is necessary - in PDF jargon, it is equivalent to the Unicode character 'Latin small letter c'. If text search is needed to RPA the files, it has to be sure that such semantics are actually there.

Above text semantics, metadata integrated into a PDF file can be a very important tool for advanced RPA processes. In PDF, metadata can be associated with the document, individual pages or even page components. Processing information can be derived from a data base and - if the documents come from an external source – added as metadata to the PDF with detailed descriptions that RPA can use. The software robot uses the information contained in the PDF and processes it accordingly. An approach that is standard in today’s prepress production workflows!

The highest degree of structural information beyond page rendering in PDF can be achieved via tagging and from a RPA perspective 'tagging' can also provide information needed e.g. for content extraction for reuse in a database. Tags (markers) define the semantics of text parts, such as headings, paragraphs, captions or tables. They are the basis for defined reading order, e.g. in multi-column layouts. Unfortunately, correctly tagged PDF files are still a rarity today - even though it is undisputed that they are the most qualified basic format for some RPA processes.

The cross-media provision of PDF documents is a use case, where among other things companies make their advertising available both in the form of printed brochures and on their website. While the 'ready-to-print' PDF requires that the file contains correspondingly high-resolution images or crop marks, the online PDF involves deleting existing print marks, limiting the visible page area and reducing complexity and file sizes for fast display. Creating different PDFs from the source layout is often impractical, as there is always the risk of losing last-minute changes. Using RPA, color spaces in the print PDF can be modified for online use, image resolutions can be reduced, pages can be cropped and complex page areas can be converted to images in advance to enable display at acceptable speed - even on older tablets.

As mentioned before, RPA with PDF files are already used in prepress workflows on a very advanced and highly automated level. Companies that want to benefit from RPA should first create the conditions for smoothly functioning RPA-based applications. While in prepress, you have additional requirements that are not needed in office, you have the requirement, in both cases, to check the quality of a PDF file and whether it qualifies for automation and to normalize it for this purpose if possible.


ABOUT THE AUTHORS

Dietrich von Seggern

Dietrich von Seggern received his degree as a printing engineer, and in 1991 started his professional career as head of desktop prepress production in a reproduction house. He became involved in research projects for digital transmission of print files, and moved to the German Newspaper Marketing Organisation (ZMG). There Dietrich was responsible for a project to enable the digital transmission of …

ABOUT THE AUTHORS

Dietrich von Seggern

Dietrich von Seggern received his degree as a printing engineer, and in 1991 started his professional career as head of desktop …

CONTRIBUTORS

© 2020 Assosiation for Digital Document Standards e.V. | Privacy Policy | Imprint