The Challenges in Archiving E-Mails with PDF/A

Dr. Bernd Wild, intarsys consulting GmbH, Karlsruhe, Germany

Abstract

The legal requirements for archiving business documents are increasing the need to also archive e-mail correspondence. This raises the question of which archiving format is most suited for the long-term archiving of e-mails. The PDF/A-1 standard also opens up new possibilities for using a uniform long-term format that contains the document character of an e-mail, can be searched for full text and, at the same time, can contain important metadata. A concept for converting e-mails to PDF/A documents is introduced and particular attention is paid to the process of handling file attachments.

Motivation for E-Mail Archiving

Business communication via e-mail has overtaken classic postal delivery by far. If there was no e-mail, many business processes could not be performed in the time that is available and to the level of quality that is required. Originally intended as a transport medium for notifications, e-mail systems have developed into more of a “Document Management System” thanks to their capabilities to exchange documents and information of any type and the flexible storage in e-mail accounts and folders. Along with the file system, e-mail systems are therefore one of the most important document storage and management systems. Often, unless the life-cycle of a document (creation, revisions, completion, release) is managed using a genuine DMS (Document Management System), it can only be retraced using the mail history.

Current studies assume that between 35% and 70% of all business communication and information in a company is now transported and stored using e-mail. Due to the importance for transaction processing, this medium is gaining more focus in legal regulations that are putting the legal regulation to obtain data and the burden of proof for e-mails on an equal footing to paper-based documents. In addition to the legal restrictions (data protection), the process of archiving all e-mail traffic also faces technical restrictions. Therefore, issues surrounding data volumes, the ability to search in the saved e-mails and the handling of spam e-mails must be solved. When it comes to the commercial law obligations to archive data for at least ten years, the main issue is how to choose a suitable storage format and how to handle file attachments.

Architecture

An e-mail consists of three parts: the header, the body and the optional attachments. Although the e-mail body comprises the contents of an e-mail that can actually be read, the header consists of attribute-value pairs that contain meta information about the e-mail. In addition to the date the e-mail was sent and the sender address, this also includes the destination address and the subject of the message. As well as these attributes that are required in accordance with the standard RFC 5322 [i], the header often contains routing information about the mail gateway (envelope sender) that participates in the transport, specifications for the coding of the mail text and a mail identification number. Since sending e-mails using the SMTP protocol is based purely on ASCII (as was also the case in the early days of the internet), all of the additional formatting and enhancements must be coded accordingly to an ASCII basis. MIME is the established standard for this and is defined in RFC 2387 [ii]. The coding of the e-mail header may be text, HTML or MIME. For compatibility reasons from mail clients, in addition to the message that is coded in HTML or MIME, a textual representation of the message is often inserted. This is particularly the case for formatted e-mails. If the sender also wants to send file attachments (non-text items such as images, PDF files, Office files etc.) as well as the actual message, these items are also coded in MIME. The participating mail client programs often have no knowledge of the attached files and, instead, they simply run MIME coding or decoding. Therefore, specific application software is responsible for displaying e-mail attachments and this software must be available on the target client system.

 

Fig. 1: System architecture in e-mail archiving

Fig. 1: System architecture in e-mail archiving

As displayed in Fig. 1, when archiving e-mail traffic, we can choose between a client-side approach (1) or a server-side (2) approach. In client-side archiving, within the e-mail client program, the client selects which e-mails must be archived. This can be done using manual selection, rules such as date created or date received or receiving addresses. When archiving is requested, the affected mails are then completely stored in the mail archive. Depending on the implementation, information that refers to the e-mail address in the archive may remain in the e-mail system. A web application is usually used to search for archived e-mails. This application should offer the option to search for full text and the option to search using specific criteria from the e-mail. The e-mails that are found are then displayed as HTML pages. Any attachments remain unchanged, which is why the original programs are required at the workstation in order to open attachments. This approach cannot allow for universal archiving, because the end user makes the individual decisions regarding the archiving.

In the server-side archiving, the entire incoming and outgoing mail traffic is stored according to rules. In addition to an efficient spam filter, care must also be taken with regards to who can access the central archived e-mails by searching in this way.

PDF/A and E-Mails

Most of the e-mail archiving solutions that are currently available save the e-mail that is to be archived either in the original format (text, MIME) or as an HTML page. In contrast to electronic documents, the visual appearance of the body of the e-mail is not usually the most important aspect for e-mails. Yet the majority of exchanged e-mails are based on simple text and can be read, created and processed using command prompt tools. As a result, a preset layout does not have to be kept. The proportion of formatted e-mails is increasing but, due to the predominant information characters in an e-mail, there is no specification for the appearance of a graphic format. The prominent attribute of the PDF/A format does not come into effect here. The metadata of an e-mail plays an important role. This metadata is frequently used for verification purposes. In this case, PDF/A, supported by XMP (eXtensible Metadata Platform), is a powerful tool for the structured storage of metadata and has the ability to reproduce attributes of the envelope sender and attributes that are specific to the mail system. A PDF/A file can therefore reproduce the complete range of information from an e-mail. Due to the XMP structure, targeted searches for metadata from the e-mail can be performed, irrespective of the archive system that is being used. When storing a PDF/A e-mail in the archive, its meta information can be used for the specific index management of the archive system.

File attachments require special attention because these cannot always be properly converted into PDF/A. While the automatic conversion of word-processing documents (for example, Microsoft Word or OpenOffice) is possible and useful, this is very problematic for files from programs such as Microsoft Project or Microsoft Excel. This is either because print areas can only be defined interactively or because the dynamic attributes of the file are essential for the information content. If media formats such as MP3 or MPG are also used, then no conversion can be performed at all. However, PDF/A conversions can be performed for many image formats.

A PDF/A-based Approach

Fig. 2 shows a basic approach to e-mail archiving in PDF/A.

 

Fig. 2: An approach for converting e-mails to PDF/A

Fig. 2: An approach for converting e-mails to PDF/A

The three main components of an e-mail are each handled in a particular way. The header and body are fed into a converter. For the body, this converter converts the text or MIME-coded contents into PDF/A. The metadata from the e-mail header is converted into an XMP structure and embedded into the PDF/A document. An XMP schema is a prerequisite for the XMP conversion. This XMP schema defines the attributes that can be used and their semantics in the form of tags. An important requirement of the PDF/A-1 standard [iii] is that the schema must be embedded. For new attributes, an automatic schema generation can be performed or these attributes can be stored, in condensed form, in one single extension attribute.

Attachments must be dealt with in a special way. A decision about whether or not to perform a conversion is based on the conversion matrix and the file type of the attachment (see fig. 3).

 

Fig. 3: Case distinction when converting file attachments

Fig. 3: Case distinction when converting file attachments

Since, from experience, most file attachments are Word, OpenOffice or PDF files and these formats can be converted to PDF/A, this results only in a small number of file attachments, which must be saved as unchanged source documents in the archive. However, for many convertible source documents, it may be useful to save the original as well as the PDF/A equivalent, particularly if dynamic contents should be retained.

By breaking down and converting a complex e-mail with attachments into one or more PDF/A files, the integrity of the e-mail is lost. This integrity was guaranteed by the MIME container in the original e-mail. Since, in accordance with the PDF/A-1 standard, PDF/A files cannot avail of embedded PDF/A files (file collections) or references that can be resolved externally, the archive system must perform this task. This must secure the entity of the original e-mail in contrast to the PDF/A file of the mail body and the converted or unconvertible file attachments. Current archive systems include the relevant precautions for this. This should become easier when the PDF/A-2 standard is released because then even embedded files can be used in PDF/A-2 files. However, in this case, only PDF/A-1 or PDF/A-2 documents can be embedded.

Conclusion

The PDF/A-1 standard opens up new opportunities for the long-term archiving of e-mails. By rejecting the option of saving e-mails in different original formats and by converting to PDF/A-1, a uniform archive format can be used. At the same time, the mail metadata can also be completely integrated into the archive document. This means that when extracting from the archive, these specifications remain associated with the e-mail header and a formatted visualization is possible. For most Office formats, the file attachments can be converted and this conversion permits a “frozen” display of the attachment. If it does not seem to make sense to convert the data, you must archive the original file attachment.



[i]    IETF, RFC 5322 (2008)

[ii]   IETF, RFC 2387 (1998).

[iii]   PDF/A Competence Center, TechNote 0009: XMP Extension Schemas in PDF/A-1 (2008).

About Bernd Wild

Dr. Bernd Wild is member of the board of the PDF Association. Dr. Bernd Wild, 47, is originally a graduate physicist. After completing his studies, he worked for several years at a computer science research center in the field of artificial intelligence and its possible applications in industrial processes. Upon obtaining his PhD, Dr. Wild was responsible for the organization and management of C/S software development at an IT service provider in the banking sector. Together with some partners, he founded intarsys consulting GmbH in Karlsruhe in 1996. Dr. Wild now concentrates on consulting and providing assistance for complex system integration projects. Document technology has increasingly become a focal point during the past few years. This includes not only the creation of documents from source data, but also the entire documentation life cycle through to archiving. Technologies like electronic signatures, intelligent forms and document standards are at the core of his activities. In addition, intarsys offers products and software components that support these technologies and can be used for easily and reliably designing customer specific solutions.

Leave a Reply