Once Upon a Time Before PDF/A

The history of archiving and formats, from its origins up till the present day.

I would like to talk about some facts concerning the history of archiving and formats, from its origins up till the present day. I hope you will accept my invitation for a small journey through time:

First stop: 1430 BC – The secret archive

Let’s start our journey about archive formats around 1430 BC. At that time, one of the most famous and mysterious archives of all time was founded. I’m sure everybody remembers the well-known story, when God called Moses and told him something like: “Hey, come up to Mount Sinai! I wrote down 10 commandments for you on two slabs of stone.” And because Moses and his folk were currently on their exodus from Egypt, and the transportation of two solid slabs was not so easy in those times, God gave Moses detailed information about building a container for transporting and storing these important documents.

ark of the covenant

But two things went really wrong with this archive, known as the Ark of the Covenant:
Everybody who tried to retrieve the documents from the archive died immediately. And I don’t know of any vendor in our times who has adapted this feature in his current archiving system.

There were some problems with the backup of the documents, and the Ark of the Covenant, including its contents, was probably destroyed in 587 BC. But even today there are still adventurers who continue the mystic search.

Now let’s talk about the following question: “Was everything bad with this archive?” I don’t think so! Did everybody in the audience catch the fact that Moses stored the documents for more than 900 years without format conversion? More than 900 years without any thoughts of ISO standardisation or migration.

Not bad!

Second stop: 1986 – A star is born

After a short time warp our journey continues in 1986. In this year, the first computer virus started to spread, the space shuttle Challenger disintegrated 73 seconds after launch and the Chernobyl nuclear plant exploded. But aside from these disasters, a new file-format called TIFF (Tagged Image File Format) was developed by Aldus (now Adobe Systems), HP and Microsoft. And I think, when we talk about all the buzzwords like DMS, ECM, EIM or GRC, we can’t forget the good old TIFF.

absolute sector

Third stop: 1997 – Project ZERO

When we started with “electronic archiving” in 1997 in the company I was working in, there was only one available standardised file format for scanned documents: TIFF. And fortunately our users did not extensively use such great features like electronic mark-ups, yellow notices on documents or blackening. What happened: TIFF was a standard, but some vendors started to expand on this standard. This resulted in some, let’s call it “standardised-consistency”, problems and suddenly not every TIFF viewer could display every TIFF. This was bad news for end users, but good news for some vendors, because a lossless data migration from one system to another was sometimes impossible. We will talk about migration again later.

invalid file

Fourth stop: 1998 – The big files

In 1998 we started to archive something called “COLD” data (Computer Output to Laser Disc). In banks you have many big lists, generated by mainframe programs. These lists were formerly stored on microfilm and you needed a special hardware to read the data on the films. Storing the lists electronically was, in some cases, a very tricky project because some lists had a file size greater than 2 Gigabytes. This caused problems in older Windows file systems. And even if you had highly standardised ASCII data, not every viewer would show you the data the same way.

when the bug comes

important document

characters

Fifth Stop: 2002 – Legal impacts

In 2002 a new legal regulation in Germany called “GDPdU” (Grundsätze zum Datenzugriff und zur Prüfbarkeit digitaler Unterlagen – principles for accessing and verifying digital documents) was enacted. Many vendors tried to use this regulation for marketing reasons: “Electronic archiving is mandatory in Germany” you could read on many stands at DMS 2002 and 2003.

In a new project, we implemented these new regulations in our company. We found 15 IT-systems in our company which produced tax relevant data. According to the GDPdU we would have to store these data for about 10 years in the original system or export the data to another system with at least the same reporting possibilities. In addition to this, the tax auditor had the right to inspect and export the data. To solve this problem, we decided to store all tax relevant data in our archiving system and made it analysable. With Wincor Nixdorf and its software “Taxnet” we found a solution that matched all our requirements. Additionally, we used this project to re-design our complete archive infrastructure.

Sixth stop: 2003 – First PDF in archive

In 2003 we received a request to scan checks. And scanning checks in black and white is not really a good idea, so we scanned the checks using grey scale. Up to 200 checks were included in a single multi-page document, because the checks could not be uniquely indexed. Because JPEG file-format did not allow multi-page images, TIFF was again in discussion. But TIFF results in large files, which were difficult to handle in every tested viewer. At that time, I saw a presentation at the DMS fair about highly compressed PDF documents. And the result: two years before PDF/A was officially born we bought a software from LuraTech to compress our non black/white images into PDF.

Seventh stop: 2004 – Goodbye to optical disks

Let’s have a deeper look at the topic of migrating archives again, because today you don’t have the same luck that Moses did: a period of over 900 years without data migration. When we started in 1997, optical media in jukeboxes was the real way to go. And every expert advised us only to buy WORM media from vendors who guaranteed a lifetime of at least 50 years. And again the idea sounded very good: store TIFF on optical disks and you won’t have to touch them before 2047. Now, twelve years later, all maintenance agreements have been terminated and the technology will die in the next years. In 2004 we started to migrate all our data from optical disks to Centera CAS (Content Addressed Storage). If you ask me today, it was a good decision. Much faster than optical media in jukeboxes, no more cache partitions needed and really more secure, especially if you wanted to store data at more than one location. Unfortunately PDF/A was not released at that time, so we copied TIFF from the old media to the new media.

pc  tower

Eighth stop: 2005 – A new, much brighter star is born

In 2005 we conducted a pilot survey for archiving loan files. The incoming paper post would be scanned first and then distributed electronically. Additionally, about 20 million pages of paper were to be scanned by a service provider.

Fortunately, the PDF/A specification ISO 19005-1 was published on October 1, 2005. So we decided not to use TIFF anymore. We bought some more licenses of the LuraTech Document Compressor, which we had already been successfully using for converting grey scale checks for several years. The software, in the meantime, could also convert TIFF to PDF/A. So when we started our loan archive in May of 2006 we were one of the first companies in Germany using PDF/A.

Ninth stop: 2009 – A short summary

Today we use PDF/A for the following functions:

  • Convert all scanned documents (black/white, grey, colour) into highly compressed PDF/A format
  • Create an alternative PDF/A file format for all Office based file formats (Word, Excel, PowerPoint)
  • Create PDF/A extracts from text files
  • Create an alternative image-based PDF/A from “foreign” PDF files (e.g. sent by e-mail or downloaded from the web)

Based on our positive experiences I can recommend using PDF/A in all of these cases.

In a few other cases PDF/A is today not exactly brilliant. Because of a very long standardisation process, you have to wait a long time for new features. The current version of PDF/A is based on PDF 1.4, so some “newer” features like JPEG2000 compression is still not supported in PDF/A. Additionally there are some limitations for using PDF/A scenarios with early scanning and document processing.

What comes next?

Most actual requests in our company are related to work on documents (like yellow-notices, bookmarks and highlighting) and process-orientated business workflows. At the moment we are talking to Adobe and our archive vendor, Wincor Nixdorf, to implement Adobe Lifecycle Server in our solutions. Our goal is to reach a high level of standardisation at this point.

With the hope of having convinced you of the benefits by using PDF/A, it is time to finish now. If you are still not convinced, I can only recommend the only other format that is guaranteed for such a long time: get a chisel, a hammer and a slab of stone!

Leave a Reply