Server-side Mass Conversion from Various Source Formats to PDF/A

The conditions

A large international energy provider needed to archive documents relating to nuclear power plant operations. Legal and regulatory requirements mean some of these documents need to be retained for a very long time. The archive system used a repository equipped with a web client. Users left all documents there in their native formats, including TIFF, Word, Excel and many others. Some of these documents then had to also be stored on microfilms, which of course meant a large amount of manual work. Almost 200,000 documents, some of them multi-page and large-format, were stored here every month.

The task

These native formats, which varied widely and were not naturally suited to archiving, had to be replaced and standardised with the PDF/A format.

This also meant attaching a component to the web server which would convert incoming documents to PDF/A. Incoming PDF/A documents needed to be validated and, where necessary, converted too. As conversion is a time-intensive process, which the web interface user should never have to wait out, the incoming files were to undergo rapid tests to predict whether the material could be converted. In addition, non-searchable files or sections of files were to be made searchable using OCR. PDF/A files generated by the component were then to be brought to a uniform resolution and converted to black/white TIFF files in order to automatically produce microfilms.

The solution

First, the incoming original file was analysed. As transparency was often used in the source files, the customer was advised to use the PDF/A-2b format instead of the originally-planned PDF/A-1b format. A number of in-house components and technologies were selected for document conversion, while components from other suppliers were selected for validating and converting from PDF to PDF/A. The customer was advised of which rapid tests were a reasonable option and necessary in order for the web interface to perform optimally for the user.

LuraTech then defined the interfaces for the web client and the microfilm exposure units, and then integrated the various technologies and products into a single .NET component. After intensive testing of the system, the customer was presented with a fully-integrated, highly scalable solution to deal with large volumes of documents in the future. The solution was created within two months and started up on a highly available server. Since then, millions of documents have been successfully converted.

After this project was completed, the solution was expanded to accommodate additional formats not previously defined, including ZIP-compressed uploads and others.

Next steps

