PDF/UA, like PDF itself, is internally complex, but used correctly, actually makes things easier.PDF Association expands its board of directors
Catherine Andersz of PDFTron Systems, Alaine Behler of iText Software and Peter Wyatt, ISO Project Leader for ISO 32000 enrich the newly elected board of the PDF Association.PDF Days Europe 2018 concludes with record number of attendees
Richard Cohn, Principal Scientist at Adobe and the co-author of PDF 1.0, gave the opening keynote at the PDF Days Europe 2018.Interview with René Treuber, Product Manager of axaio software, about PDF Days Europe 2018
René Treuber, Product Manager of axaio software, will be hosting a presentation titled “Introducing ISO standards for PDF “processing steps” and “print product metadata”” at the PDF Days Europe 2018. In this Interview he gives some background information about it.Interview with Roman Toda, CTO of Normex, about PDF Days Europe 2018
Roman Toda, CTO of Normex, will be hosting a presentation titled “Encryption with PDF 2.0” at the PDF Days Europe 2018. In this interview he gives some background information about it.
These days, many cultural institutions (scientific and public libraries as well as state, private and ecclesiastic archives) are digitializing valuable cultural assets such as books, prints and maps. Along with the aim of enabling a broad public or scientific use or to protect valuable originals from direct access, this process is used in order to preserve the historic originals and to securely store them in the most optimal environmental conditions.
In addition, the approach is to digitalize these originals in a high quality and resolution, or, in the best case, the highest quality and resolution according to the current technical state. In Germany, this means that, in accordance with the regulations from the German Research Foundation (Deutsche Forschungsgemeinschaft (DFG)) for the retrospective digitalization, black and white originals must be scanned at a minimum resolution of 600 ppi and gray scale and colour originals must be scanned at a minimum resolution of 300 ppi.
Particularly in the case of very valuable originals, we must attempt to reach the highest possible resolution that corresponds to the technical state and therefore be able to offer a very broad range of usage options. As an example, we will use the Beethoven-Haus in Bonn. When digitalizing documents on site, SRZ used a scanner that was of a particularly high-quality and high resolution. In accordance with the size of the original document, the inclusion head uses all of the available resolution capacity of the camera in relation to the size of the template. The digital masters that result from this were saved as uncompressed TIFF and may be several hundred megabytes in size. They are used as the source format for derivatives for various applications and solutions such as printouts and web display.
But what type of data is created during the digitalization of documents from libraries and archives? Put simply, the initial type of data created is image data that you want to display in a social context, and you want to provide the users with data about the development, temporal and contextual relationships, creators and current location.
In addition to image data, a whole range of other information is therefore collated and gathered. This begins with the bibliographic metadata, in other words the descriptive data for the document, such as the author or creator, the place and date of publications, publisher, printer, edition, etc.
Then there is the metadata that relates to contents and structure. This involves, for example, recording an existing abstract or creating an abstract. In addition, today, it is also common to use OCR to process all of the documents that are suitable for this and to save the results in their uncorrected form. This provides the basis for conducting a fuzzy search in the text and to highlight search results in the facsimile for the presentation on the Web or in other applications. However, this only works if the positions of the identified words are also saved on the page display in the application.
Structural metadata is created, for example, by recording tables of contents and their link to the physical start of the chapters in the image data or other parts of the work, such as the register, register of places, register of people or graphics, volumes and so on. For this, you must assign the existing paging in the work to the physical files and you must also specify structure elements and contents such as titles, headers and similar elements. The creation of structure elements may go all the way down to page areas such as margins, images or footnotes.
It is also normal to gather the technical metadata for the creation and the physical attributes of the digital representation in order to prove the history of the digital documents in this case. Metadata includes, for example, resolution, bit depth, compression, date recorded, the institution that gathers and owns the information, scan software, scanner hardware and similar information.
Today, all of this descriptive data that relates to contents and structure is gathered into a specific XML schema and saved. The schema that is used mostly across the world is, in this case, the Metadata Encoding and Transmission Standard (METS) for libraries or Encoded Archival Description (EAD) in the world of archiving. (see http://www.loc.gov/standards/mets/ and http://www.loc.gov/ead/.
Various compression procedures are used for image data. Bitonal images are usually saved in the TIFF format that has been compressed to lossless fax group IV. TIFF is also used for digital masters in gray scale and colour and these are stored as lossless uncompressed files or as compressed in accordance with LZW. In the case of derivatives for different purposes, formats such as JPEG, GIF and PNG and various resolutions are used. The JPEG 2000 format, which was approved as an ISO standard at the start of this decade, is becoming more and more popular as a compression method that allows considerably higher compression with incomparably higher quality than the traditional JPEG. A lossless (compression without loss) variant is also available for JPEG 2000.
You dont just want to archive the colour photographs of valuable originals in the highest possible resolution, you also want to be able to reproduce the colours for the screen and printing in such a way that the human eye identifies it as the original. You can attain so-called colour fastness using the colour management with colour profiles. Using colour charts, in which the RGB or CMYK colours are stored as numerical values in a reference, the variances in the colour devices are calculated and the differences to the standard are saved. This then becomes the so-called colour profile. These differences are then attached to the associated image, either as part of the image or as an attached file. In each case, the specific differences of the output devices to the standard, such as printers and monitors, are determined and saved in the same way and their display can therefore be adjusted using the comparison to the reference.
Lets first summarize which data is involved in the digitalization and should be taken into account during long-term archiving:
For each long-term archiving of a library or archive unit, the data, saved under different formats, are combined to form a data-technical unit, for example, a TAR archive, and then saved to a suitable archive medium.
To check the integrity of the data at a later point, an additional checksum file (designed with a suitable checksum algorithm) is usually saved.
We are dealing with an information package that is obviously a quite complex entity, contains formats that differ greatly, must include two information units and cannot necessarily be read again by each TAR program, particularly in the world of Windows.
In contrast to all of the formats mentioned above, PDF/A is completely disclosed and is a defined ISO standard that, as the first ISO standard, does not have any time restrictions. PDF/A is a normal PDF that can be opened and read properly using any program that can display PDF. PDF/A does not depend on any operating system, because PDF readers exist for almost every operating system environment.
How does PDF/A behave with the mixed bag of information from a digital representation, as described before?
An additional advantage is that the full text that was obtained through OCR is not only also saved in PDF/A but it can even be placed behind the text in a searchable format, so that search hits can be highlighted in the facsimile in a way that is user-friendly.
The verification of the data integrity can be directly included in the PDF/A file using digital signatures of various levels of conclusiveness (from simple to qualified) and is not separate information.
Independent of any device or operating system: Can be reliably displayed on various systems and devices
Self-contained: Contains all of the components that are required to display the data
Self-documenting: Contains descriptions for the integrated data
Freely accessible: Does not contain any technical access protection
Open source: Authorized format definition is completely available
Wide distribution: Wide usage is perhaps the best protection for the readability of long-term archives (see http://www.aiim.org/documents/standards/19005-1_FAQ.pdf)
My speech about retrospective digitalization using PDF/A mentions some examples:
All evidence suggests that the considerable advantages of PDF/A that usually exist in contrast to all other data formats for long-term archiving will lead to further distribution of this standard. More and more applications whose data needs to be securely archived for a long time use PDF/A.