The Publications Office Digital Library Project

High Volume Book Scanning creating PDF/A-1b

Abstract

The European Publications Office (‘the Office’) is currently digitising its historic archive of around 130 000 publications dating back to 1952. The aim is to make the entire collection accessible for free download and online consultations by October 2009. Outsourced industrial, non-destructive mass digitisation enables production, delivery and upload of 1.5 million pages per month. The main delivery is a PDF/A-1b file containing bookmarks, basic metadata and a background text layer.

1. Content and selection

In 2006, around 65 000 publications’ notices were made available on the EU Bookshop website (http://bookshop.europa.eu) and a scan-on-demand service was offered. However, the popularity of the service meant that the capacity of the in-house scanning workshop was soon reached and a backlog of several months began to accumulate. Therefore the Office decided to launch a digitisation project, not only to reply to this increasing demand for individual publications but also to create a complete digital archive of its publications available on the EU Bookshop site. A content-based selection was considered but rejected, as the requests received were as inhomogeneous as the content offered. The value of the archival material was regarded higher as a collection rather than as single parts taken out of context.

The collection to be scanned contained some 130 000 publications produced between 1952 and 2002, by 370 so-called ‘author services’, e.g. EU institutions, agencies and other bodies, in mainly, but not limited to, 11 languages. This heterogeneous content is estimated to contain 13 million pages.

The produced files serve three purposes: long term preservation, a print-on-demand capability and online presentation.

2. Technical requirements & implementation

2.1 Quality of the delivery

The specifications require to produce TIFF 6.0 in min. 300 dpi native resolution, in 24 bit RGB or 8 bit greyscale, with Gamma: 2.2, geometrically corrected, de-skewed, shadows eliminated, brightness variation deleted, operator’s fingers masked, cropped, in original size, correctly rotated to enhance readability and ZIP compressed. This is one delivery, received on external disks.

The above-described TIFF images have also to be embedded into a PDF/A-1b, being at the same time compliant to PDF/X-3 to guarantee print-on-demand capability. Pure text pages have to be binarised in order to save storage space. The metadata are encoded in the XMP stream and in the PDF properties. The PDF also contains an ICC profile, volume optimised to reduce file size. The amount of typefaces is also limited to further reduce file size. An OCR background layer, basic metadata and bookmarks are added. This PDF/A-1b is sent via Internet/FTP to the Office, in a ZIP package also containing the metadata files (separately in XML) and a JPEG thumbnail of the cover page. These files are also separately delivered on DVD.

2.2 Quality control

Several stages of quality control have been implemented both on the contractors’ side and on the Office’s side. Quality control at the Office is split into two main parts:

Automatic quality control (100%): The reception chain validates several pre-defined parameters (if publication is listed, checksum, file naming convention, PDF/A-1b standard compliance, bookmark quality, OCR recognition report entries, resolution of embedded TIFF and of thumbnail, metadata schema compliancy). Files failing any one of these criteria are automatically rejected.

Manual quality control: The batch volume for the manual quality control is determined manually on a page basis (usually, several days are accumulated to reach batch sizes close to 500 000 pages). Applying the ISO standard 2859_1999 – Sampling procedures for inspection by attributes, 800 pages first line + 80 pages second line image control and 125 complete publications are then sampled by an automatic procedure. No human manipulation is allowed; the sampled pages and publications are automatically communicated to the commercial partner by XML via FTP. The required quality of 97.5 % (i.e. 22 rejected pages or 8 rejected publications ) leads to the rejection of the whole batch.. The results including all annotations for correction are also communicated by XML via FTP to the commercial partner. Rejected batches have to be re-delivered within 4 weeks.

Viewing conditions: 24 inch screens have been purchased for the verifiers at the Office.

2.3 Collection management

Organization of the files: The PDF/A-1b files are stored on WORM (write once, read many) – media (magneto-optical disks) in double redundancy; a third backup system uses tapes. A numbering scheme that reflects unique identifiers (‘catalogue numbers’) already used in an existing, separate cataloguing system has been applied. The TIFF files are stored on magnetic disks in double redundancy.

Naming convention: A detailed schema describing the naming of the files has been written.

Use of metadata, data management techniques: 50% (around 65 000) of the notices existed already in MARC 21 (Machine Readable Cataloguing) format. The Office decided to request the production of a very basic set[i] of Dublin Core metadata­[ii]. For publications which already have an existing notice, the basic metadata produced will be used to verify and, where necessary, further enrich these existing notices. If no notice exists for a publication, the basic metadata will enable the publication to be displayed and retrievable on the web. It is anticipated that there will be a complementary project in the future to further enrich all these notices.

Document encoding: XML (eXtensible Markup Language) encoding described in schemas is generally used.

3. Project management

3.1 Phases

Introduction: The project has been split into 2 phases: The pilot project scanned more than 6 100 publications containing 1.1 million pages within 8 weeks. The main project runs between September 2008 and July 2009 and will scan some 124 000 publications containing around 12 million pages. An already existing framework contract based on an open international competition contract has been used.

Needs assessment: The main hurdles to be overcome were: non-destructive scanning, the production of the bookmarks and the encoding of metadata, combined with the sheer volume to be handled, generated, transmitted and stored.

Performance indicators: Production volumes were fixed as displayed below. Official holiday periods reduced the required volume during certain stages of the full production; delivery delays triggered financial penalties. Rejected batches had to be re-delivered within 4 weeks.

Production calendar:

Minimum pages/ period of 4 weeks Kick-off 07.08.2008 Preparation Service level
agreement & testing
500 000 pages / 4 weeks 750 000 pages / 4 weeks 1 000 000 pages / 4 weeks 1 500 000 pages / 4 weeks
Production week 1 – 3 4 – 6 7 – 10 11 – 14 15 – 18 19 – 47 26.07.09
Phase Start-up phase Production Volume 1 Production Volume 2 Production Volume 3 Full Production

August and September 2009 will be reserved for completing the final quality control, accepting any eventual re-deliveries, ingestion for long-term preservation, downgrading deliveries to web-optimised files and publication on the EU Bookshop website.

Staff involved: The project has required the direct or indirect contribution of two thirds of the units of the Office, but due to the externalisation of the production only 7 full time staff[iii] are exclusively devoted to the project.

Time frame: The project started with the preparations of the tendering procedure in October 2007 of the pilot project and will end in November 2009 with the final documentation delivery.

3.2 Management of the digitization cycle

Preparation of publications: Since 1952, two copies of each publication have been stored side by side in the basement of the Office’s main premises. In December 2007 preparations began to separate the two copies[iv], package a single copy of each publication into cartons, prepare the cartons for the bi-weekly pick-up for scanning, handle any exceptions[v] and verify completeness, good order and condition of the returned publications.

Data management: The PDF delivery packages are delivered via Internet (2 x 8 Mbit/s) and received by a modem after passing a first firewall, then collected in a cache where the delivery date and the time are logged. The files are then transferred daily via a second firewall to the validation system and after having passed the automatic control kept in a second cache. Finally after having been selected for a batch and accepted, the packages are dismantled and digested into the two archiving systems (XML metadata[vi] and PDF incl. thumbnails). The appearance of the relevant files in the two systems triggers the publication of the individual notices on the EU Bookshop website after uploading of the metadata and the downgrading of the PDF which then is kept in the cache of the website.

 

3.3 Managing the workflow:

A specifically developed system enables the:

  • Establishment of a tracking system for project review
  • This co-ordinates and records the workflow in a database, to reflect:
    • the beginning and end date of each activity,
    • the processing steps undergone by each package, e.g. date of reception, quality control, change history and ingestion. The digital preservation record and the web publishing are monitored separately.
  • Supervision of the quality control
  • The quality control is entirely integrated into the workflow system. It sets a consistent standard of delivery, and tracks the status of processing. Further more, the
  • Reports enable the regular documentation of progress.

An access to the production and delivery database of the contractor enables full control of the delivery cycle. A weekly progress report is presented to the Office’s management, introducing a level of accountability to the project team.

 

The chart outlines how the project workflow is organised.

The chart outlines how the project workflow is organised.

4. Experiences

4.1 Time planning

The start-up phase (4 weeks for the pilot project and 6 weeks for the main) was extremely tight and insufficient for both parties to fully develop and test their workflow systems. Recruitment and training of personnel and purchase and installation of additionally required hard- and software were also affected.

4.2 Open source vs. proprietary

Both parties decided to use proprietary software, for the production of the background text layer (OCR) and the creation of the PDF/A–1b (compliant to PDF/X), the data storage and the creation of the web-optimised PDF. Performance and service had to be guaranteed at all times during the day, therefore both parties considered the use of open source software too risky for a production of such a volume under extremely tight deadlines.

4.3 Standards

The resolution of 300 dpi was seen as a good compromise between readability, printability, general resolution standards of mass book scanners and file size.

Bit depth: 24 bit (colour) and 8 bit (greyscale) were chosen, to keep all information possible and to avoid re-scanning in the future. A single coloured dot or line on a page required scanning in 24 bit.

ISO standards: TIFF, PDF/A and the sampling standard are well established and documented.

4.4 OCR quality level

The pilot project required an accuracy of 99.5 % compared to the original text. Soon discussion started about the interpretation of the encodings, as OCR software displays only self-detected uncertainties. For instance a title printed in spaced out text on the physical publication was also spaced in the PDF background text layer and therefore could not be found if searched for. A very detailed description of the expected encoding and a rigorous quality control would then have been necessary, which could not have been done in such a short time and with justifiable costs. Therefore, the following approach was decided on for the main project:

Production of a basic OCR layer: This was the standard quality chosen for the mass production, being sufficient for the indexation of the full text background searchable layer. The minimum was set to 65% character accuracy per page, recognised by the OCR tool. A report, created before the manual quality enhancement, is part of every delivered package. The report serves also as general quality indicator of the scan. A later, second round of the production of an OCR layer in some years is envisaged, depending on the advancement of OCR tools on text recognition. This is the main reason why the 8 bit greyscale version of the images has been kept, hoping for better results as with the binarised 1 bit text versions currently kept in the PDF.

Production of an enhanced quality OCR layer: The quality has been set to min. 99.35% character accuracy per page. Images, graphics and complicated formulas are excluded from OCR treatment although titles have to be OCR treated. It is intended to request this quality at a later stage for publications which are frequently downloaded, depending on the financial resources available.

Production of a high quality OCR layer: The quality has been set to min. 97.5% word accuracy per page. Copy and paste of text and tables should be possible without major re-work. Spaced-out printed text must be searchable by word. This most expensive and time consuming encoding will be ordered for the most requested publications only and of course depending on the financial resources.

4.5 RGB vs. CMYK

The Office opted for RGB. The printed colours are in CMYK, but the cameras capture in RGB. A conversion into CMYK in the delivered files would have resulted in reduced quality for long-term preservation and web dissemination aims due once again to the reduced colour space.

4.6 Quality control

External vs. internal: For the pilot project the Office used an external partner with an internal second line control, but for the main project the quality control was internalised due to the lack of a useable contract. The Office had good experiences with both approaches, but favours the internal two-level control for this type of project as it permits direct control and the complete integration into its own workflow systems.

Timing considerations: The pilot project foresaw 2 weeks for the acceptance of a batch, however, this was considered too short. The 4 weeks period for the quality control and re-delivery foreseen under the main project provides sufficient time for achieving these tasks.

4.7 Data management

The transmission of data on external disks which requires exact hand-over/ take-over arrangements is not as convenient as via electronic transmission which allows continuous transmission and reception as well as automatic logging and processing. The human resources required for data administration should not be underestimated.

4.8 Personal commitment

This project requires a high degree of effort from our commercial partners as well as staff at the Office. The production starts every week at 0 a.m. on Monday and turns continuously day and night in 3 shifts until it stops on Saturday at 8 p.m. During the preparation period and the first months working until late night and on weekends was standard for many of the people involved. It is at this point that I would like to acknowledge the invaluable input of everybody who has contributed to the success of this project which was regarded for a long time as a ‘mission impossible’.

5. Follow-up projects

Collection management: The two current collections (PDF/A-1b 24/1 bit and TIFF 24/8 bit) will probably be unified by replacing the current 24/1 bit images with 24/8 bit ones. This is currently not possible due to the high costs of the WORM media used for long-term preservation. The creation of PDF/A-1b larger than 2 GB as well as print-on-demand capabilities in managing files above this size will also have to be investigated. It is still to be decided if the raw, uncompressed TIFF will also be stored.

Further OCR enhancement: Any future approach will strongly depend on how this published collection will be received by the users. Download figures will be analysed and financial resources made available, if necessary and possible.

Metadata enhancement: It is intended to further enhance the quality of the existing metadata with future projects.

Other collections: Our ‘author services’, e.g. EU institutions, agencies and other bodies, hold further collections of unknown amplitude. There is high interest to identify these and scan them in order to make them available also on the EU Bookshop website.

This article was written by

Anton Zagar, EU Publications Office

© European Communities, 2009

The views expressed in this article are those of the author(s)

and do not necessarily reflect the official position of the European Commission.



[i]    Unique identifier (‘catalogue number’), language version(s), title and subtitle(s), number of pages, corporate author, year of publication, identifiers (ISBN, ISSN, EUR – a specific scientific collection identifier) and series (if applicable).

[ii]   This information is delivered in a separate XML file and also encoded in the XMP stream of the PDF; the PDF properties display a non-coded version of the corporate author and of the language version(s).

[iii]  1 project coordinator, 6 quality controllers, 1 scanning operator and 1 bookbinder

[iv]  The second copy is now stored 35 kilometers away

[v]   Publications without identifiers, single copies, bad condition, bad prints etc.

[vi]  A script verifies and replaces if necessary old by new metadata information.

About Anton Zagar

Anton Zagar is employed at the EU Publications Office.

Leave a Reply