This paper describes how PDF/A is used in Siemens I IA (Industry Automation) & DT (Drives Technology) PLM systems. It describes the problems and solutions from the introduction of PDF/A in 2007 to the plans for the next years.
2. Product Lifecycle Management and PDF/A at Siemens I IA&DT
The Siemens Industry sector is the biggest sector after the Energy and Healthcare sector. IA&DT is the largest division in this sector and consists of ten business units. These ten units use one PLM/PDM system to manage all product specific information during the lifecycle of a product. It covers information about development, production and distribution of a product and all of its parts. This information is managed by the system, with the data being collected from many different sources including CAD, CAE, controlling and distribution. The PLM system also prepares and delivers the data to other systems like SAP, long-term archive and production systems to name a few.
Especially the preparation and archiving of the data creates the need for neutral file formats. The permitted archiving formats for 2D files are at the moment TIFF G4 and PDF/A. PDF/A has been an official archive standard for Siemens since 2007. In Q2 2007 we introduced PDF/A to our PLM system Cadim.
3. PDF/A in Business Today
The PLM system we use at Siemens I IA&DT is at the moment Agile 5.1 (Cadim). Cadim has been the official PLM system of I IA&DT (formerly A&D) since 1997 and was launched with 1,600 users. Now we have over 9,000 active users and 16,000 inactive and information users on over 30 sites world-wide. Information users have no write access to the system but they will be informed if data which they need has been changed (e.g. suppliers). The 9,000 users created about 1 million logins to the system last year. The user count has increased between 10% and 40% annually over the past 10 years. We administer more than 4.1 million documents with some 6.4 million files. The size of all files is over 5,000 GB with 3,700 GB in the long-term file archive.
The system manages different 2D file types for technical data (cgm, plt, …) and textual data (doc, xls, ppt, txt, …) as well as software archives (zip, tar). To release and archive documents containing technical or textual data, Cadim creates a copy of the files in the long-term archiving format TIFF G4 or PDF/A.
3.1 Format conversion in the I IA&DT PLM system
The format conversion in Cadim takes place before a document can be released. In a pre-release workflow the format conversion is initiated and the document files are sent to the converter. After successful conversion the converted files are checked into Cadim and the original documents can be released. The release of the documents is based on the converted files and not on the original 2D construction files. It is executed by a special engineer with releasing rights.
The conversion is done on five servers: three for technical documents and two for MS-Office documents. The technical documents are converted on Linux machines with a RedHat operating system using gXconvert from Seal Systems to convert 2D drawings and PDF files to TIFF G4. Each server has three Intel Xeon™ CPUs at 3.00 GHz. The office conversion is done on two Windows 2003 servers with two Intel Xeon™ CPUs at 3.00 GHz.
At the moment we have 3,500 conversions of technical documents per day. The number of files per document is on average slightly less than two, so we can say that we have around 6,000 file conversions for technical documents daily.
For the MS-Office conversions we have approximately 30 documents per day with an average number of 1.5 files per document. The maximum number of documents we have converted in one day (7 hours) was 4,000 when we imported entire projects that were created before we introduced the project document feature into Cadim in 2007. This year we had a peak in April with 2,000 files in 3 hours. These peaks appear when whole projects are released at once. That is not unusual and is not included in the average of 30 documents daily, but the conversion architecture has to deal with such peaks without increasing the average waiting time for a document in the queue to more than 10 minutes.
3.2 Development to use PDF/A in Cadim
After 10 years of Cadim history the main archive format for technical documents is still TIFF G4. We watched the ongoing discussions about PDF/A from the beginning with interest, because in some cases TIFF G4 doesn’t satisfy the needs of our users. At the beginning of 2007 a PDF/A task force was founded with the goal of introducing PDF/A as a further archive format for Siemens. The task force approved PDF/A as an additional format to TIFF G4 in May 2007.
Parallel to the PDF/A task force there was a requirement to integrate the management of projects into Cadim. It was also requested that project documents be archived in our long-term file archive. Project documents consist primarily of MS-Office files, e-mails and PDF files. TIFF G4 is not the best archive format for these file types because of its poor usability, especially in the office environment. So we decided on the new PDF/A standard for the office conversions. The publication of the first PDF tools which created PDF/A coincided with our introducing Cadim for development. We took the decision to use one of these tools to create our PDF/A files for the long-term archive.
3.3 PDF/A creation in Cadim
We have integrated the PDF/A creation directly into our workflow using the print function for office documents in Cadim. When a document enters the release workflow it is automatically sent to the printer queue on one of the two office conversion servers. The Cadim instance on the server opens the document and starts the conversion of the files one by one. Only Word, Excel, PowerPoint and PDF files are converted into PDF/A. As soon as the PDF/A file is created it is checked-in to the actual document in Cadim with the addition “old_name_CONV.pdf” to differentiate it from the original file.
The conversion process opens the files in the related office application and starts the PDF creation with PDF/A-1b as the output format. The integration into Cadim gives us direct control over the conversion process in Cadim and the internal error detection mechanism, without the need of an external conversion mechanism as is used for technical documents. We can apply the integrated error handling for the fault-prone office conversion.
Figure 1. Cadim server architecture simplified
4. Experiences and Problems
Introducing PDF/A at the same time as a new feature was easier than the replacement of the TIFF G4 conversion for technical documents (see more about the introduction of PDF/A to technical documents in section 6). We could define a complete new process without having to worry about old processes that needed to use the conversion results. Convincing our users to use PDF/A instead of TIFF G4 was quite easy, because they didn’t want TIFF due to the difficulty of handling it in the office environment. So PDF/A was the best option for archiving office documents.
The first year with the office conversion in the productive application was quite difficult because at the same time we had the main migration of old projects into the system. This migration included 200,000 files which had to be converted. Before this migration we had no experience with possible errors and problems with the conversion of office documents to PDF/A. We only knew about the problems with encrypted files and non-embedded fonts when converting PDF files to PDF/A. We were unaware of other problems because we had no experience with converting office documents.
4.1 Problems with PDF to PDFA conversion
The main problems encountered when converting PDF files to PDF/A are encryption (49% of all errors) and non-embedded unusual fonts (49% of all errors) e.g. Chinese fonts and special engineering fonts. These problems exist mainly for external documents like certificates and documents from suppliers.
Certification authorities in particular use encrypted files that do not meet the PDF/A standard. They use encryption, non-embedded fonts and PDF versions later than 1.4. Not all of them are willing or able to change their PDF creation processes to meet the PDF/A standard, or our employees don’t ask for PDF/A conforming certificate files. The other two percent of the errors occur because of defects in the structure of the PDF files or corrupted embedded images.
4.2 Problems with MS-Office to PDF/A translation
After the introduction of the office conversion, Word and Excel files caused most of the errors we had. The reasons for the errors vary from document to document but we could identify the following main causes: macros, Visual Basic scripts, read-only files, repagination in Word documents in files with more than 500 pages, and large files with more than 100 MB and more than 5,000 pages. Most of the stoppages during the office conversion were caused by files having one of these elements. A complete stoppage of the format conversion also causes a stoppage of all pending release workflows and requires manual intervention by a Cadim system administrator to restart the conversion.
Forbidding macros and scripts is not an option for project documents because they make it easier to create and work with the documents before they are put into a Cadim workflow. We found no user acceptance for such a measure, and it was also not an option for the 200,000 files of the initial migration we had to convert. The only option was for the conversion process to learn how deal with such behaviour of the office documents.
We analysed the macros and scripts deeper and found out that most of them want user interactions or launch info and dialog boxes. But we also found more complex Visual Basic applications like small ftp clients or version management applications embedded in Word files.
To solve the problem with information, dialog and application boxes we use an Open-Source scripting tool that can simulate mouse and keyboard interactions. This tool recognises window titles and performs different actions if a window with a predefined title appears. For example, some Excel macros inform the user that only the green fields are editable. This message comes up in an info box with the title “Info”. The scripting tool catches this dialog and sends the equivalent of a return key to the window to end the dialog. Sometimes it is possible that a dialog with the same name appears in a different context, so we send another action key to the dialog if it is still there after the first try. Today we have 15 scripts in use on the server with most of them covering a group of similar dialogs.
With these scripts we reduced the error rate from 25% at the beginning of the migration to less than 2% today. But this rate is considerably higher than the 0.1% we have in the conversion of technical documents. A reason for the higher error rate is that the office applications and the Cadim clients crash sometimes without a recognisable reason or the office applications try to recover documents with no defects.
5. Plans for the future
At the moment we are replacing our current PLM system Cadim with Teamcenter 2007. In Cadim all development has been stopped since 2008 except for error correction. That’s one of the reasons why we still use TIFF G4 as an archive format for all technical documents. The other reason is that we have one big business unit that retrieves all the production data directly from our long-term archive using the stored TIFF G4 files. They can’t change their processes to use PDF/A and TIFF G4 simultaneously at the moment. But the production processes in the future will need to use both file formats when they want to use the old data provided only in TIFF G4 and the new data provided only in PDF/A. This major change will take a minimum of three years to implement.
5.1 Development for Cadim
Within the next three months we plan to introduce PDF/A for the first technical documents – the system generated parts lists. This will be our last innovation in Cadim. The reason for this change is a requirement to revise the layout of the parts lists. This gives us an opportunity to change the output format of the lists at the same time. At the moment we create PDF files and convert them to TIFF G4. Parts lists have a rate of over 15% of the daily format conversions. The tool that creates the parts lists can’t create PDF/A but we hope to fix some problems with the created PDF and adjust it to PDF/A. The main problem is that the tool can’t embed fonts, and it uses some unusual fonts that will have to be replaced.
5.2 Development for Teamcenter
Teamcenter will be introduced in three main releases, with each of them replacing parts of the Cadim functions. The first release is currently in productive use but it covers only a small percentage of the Cadim functionality and is only used by one business unit. The next release is planned for December 2009 and will take over the full project document feature from Cadim with PDF/A as the archive format. In Teamcenter, all system generated files are PDF/A or will be PDF/A if the creation feature is planned in one of the later releases. Whether or not we’re able to replace TIFF G4 with PDF/A during the introduction phase of Teamcenter is still open and depends on our business units. Some of them need the converted files to be available immediately and others need them only for archive reasons. For the second group it will be easier to introduce PDF/A as the archive format. The introduction will probably be done step by step, but that is usual for such a heterogeneous environment like we have for our PLM system. We can’t introduce certain features and changes with a big bang.
The introduction of a new archive format while at the same time implementing other new features, as was the case with the project documents, was quite easy in general. But we needed more than a year to reach a tolerable error rate for this completely new conversion process. Now it is working fine and only requires a bit of fine-tuning from time to time. The acceptance of the PDF/A file format with the users is good and the conversion process is stable, after a learning period on both the user and application support sides.
Changing the old processes currently in place for technical documents is not that easy and we need a lot of further work and time. This will be done step by step in the next years.