The PDF Association started in 2006 as the “PDF/A Competence Center”. The mission was to identify – and thereby establish – a common interpretation of the PDF/A-1 specification. With that accomplished through meetings open to all members, the secondary …“PDF can do THAT?!”
PDF files deliver a complete package of information that defines a document; everything that’s needed to represent the text, graphics and layout that the recipient receives. To most people, PDF is “electronic paper” – the digital expression of a cellul …The only digital document format
What is a “document”? A document is a record of some (typically written) content – a publication, a contract, a statement, a painting – at a moment in time. Until the advent of computers (and scanners), the media typically considered useable for such r …Save the Date: PDF Days Europe 2018, May 14-16, in Berlin
PDF Days Europe is the most popular PDF event of the year. It’s where the PDF industry meets, and where institutional and corporate users come to learn what else PDF could do for them. The first two PDF Days will offer a broad range of educational sessions focussed on current and perennial topics in the world of PDF technology implementation.The Power of the Page
It’s a question that vexes vendors of web-based solutions everywhere: why do people still insist on PDF files? And why does PDF’s mindshare keep going up? “PDF is such antediluvian technology!” they say. “It’s pre-web, are you kidding me? It’s so old-f …
Public and private enterprises like to keep up with the times; they launch projects and ride on the crest of the digitization wave. Infrastructures for centralizing archives and worldwide online research are created to raise productivity and lower costs, as is always the case with these projects. But do we actually have digitization under control? Are we not creating new risks? What do we need to know to prevent projects from turning into nightmares?
Suitable scanners and corresponding software are prerequisite to prevent digitization projects from turning into nightmares. A good consultant to define the ideal workflow would also be advantageous. But basic knowledge of the digitization process is doubtlessly helpful when it comes to making projects a success. This article provides an insight into some aspects of specialized digitization software and is intended to facilitate its selection and use in concrete projects.
What does scan software do?
Scan software carries out a number of steps along the path from a paper-based to archivable document, independently of architecture and scope; some of these steps are optional, whilst others remain invisible to the user.
Image acquisition: The scanner creates a black-and-white or color raster image of the scanned paper and hands it over to the scan software via a TWAIN, ISIS or FAX interface. The format and resolution of the raster image are selected at this point. Documents received by fax hardly differ from scanned documents and can usually be processed using the same software.
Automatic image processing: Images can be prepared for a quality inspection: blotches and empty pages are removed and the brightness and contrast adjusted to achieve optimum legibility, to name but a few of the steps in this process.
Quality inspection: The scan operator can carry out a visual inspection, intervene where necessary and repeat the scanning process for individual pages or the complete batch. Simple classification data such as the batch number are often entered at this point (operator workstation).
Text recognition and barcodes: The conditioned images are now processed by OCR software (OCR = Optical Character Recognition). The pages are first rotated to the reading direction, after which the text and barcodes are recognized and allocated to the images.
Classification: The text and barcodes recognized by the software can be used to classify the document. It can differentiate between invoices, delivery notes and other transaction documents, for instance, or assign a tax declaration to the declaring person. This step in the process can be carried out manually (index workstation) if automatic classification is partially or entirely impossible.
Metadata input (indexing): Information from the manual classification of barcodes and other sources is summarized as metadata (index data) and assigned to the documents.
Segmentation and compression: The memory space requirements for scanned raw image data are considerable (45 MB for one A4 page in color with 400 dpi). Efficient compression processes significantly reduce the amount of data (to around 200 kB). Additionally, a special process known as MRC (Mixed Raster Content) can reduce the data further still (to around 20 kB). This process is based on segmentation: splitting the image into individual components such as background, text and photos.
PDF/A generation: The processed and compressed images of each page, the recognized text and the metadata are combined with the scanners color characterization (ICC color profile) to generate a PDF/A document. Metadata is often subjected to additional separate processing (index file).
Digital signature: A digital signature can be applied to ensure the legal comprehensibility of the documents condition at the time of receipt.
Validation: The conformity of the generated document with the PDF/A standard and the validity of the digital signature can be verified and the results documented in a log.
The product of digitization: PDF/A
PDF/A is an ISO standard for the use of the PDF format in the long-term storage of electronic documents. It was first published on October 1, 2005 as ISO-19005. The PDF/A standard defines a file format based on PDF called PDF/A that offers a mechanism that represents electronic documents such that the visual appearance remains preserved for an extended period, independent of tools and systems for producing, saving and reproducing it. The PDF/A standard is not a new format, but rather defines the requirements that documents created on the basis of the PDF format need to fulfill for reliable long-term storage. Parts 2 and 3 of the standard have since been published to ensure the format stays abreast of developments.
Doesnt the popular TIFF format offer the same features? Yes, at first glance. Both formats can store scanned raster images. However, PDF/A is the more up-to-date format and offers numerous advantages. The most important are:
Architecture: Local or central?
The choice of architecture depends greatly on the type, scope and regularity of processing. A simple multi-function scanner with integrated scan software is sufficient for occasional personal use. These circumstances hardly call for comprehensive digitization projects. The question is rather how to combine all the different needs of an enterprise to create a uniform scan strategy. Multi-functional devices (MFP) located in each department cater for the personal needs of employees, whilst scanning lanes with batch scanners in service centers regularly process high document quantities. The specialized software for each scanner is usually installed locally, often as a part of the device itself. This could be a reason for the growing popularity of multi-functional devices. However, local solutions are not as popular with regard to high-performance scanners because they are expensive and their decentralized architecture can slow down processing. Hence the centralization of complex and costly scan software functions to counteract these problems. The functions are often split as follows:
This distribution increases the scalability of the architecture, which in turn results in lower acquisition and operating costs and greater throughput for high document volumes.
Source: www.bit-news.de, February 2013
Author: Dr. Hans Bärfuss (PDF Tools AG)