XMP Metadata in PDF/A – European Patent Documents in PDF/A

The European Patent Convention says up to 4,000 new patent applications and specifications are published every week. Since its foundation in 1978, the European Patent Office has published around 3.3 million patent documents. The global intellectual property database espacenet storing all electronically available patent documents worldwide contains currently around 80 million patent documents and is growing yearly by more than 2 million documents. Could PDF/A be a solution for them?

June 19, 2009

Article

The European Patent Office EPO publishes patent documents in XML format with embedded images in TIFF Fax G4 and additionally in PDF format. According to the European Patent Convention, the new patent applications and specifications are published every week on Wednesday 2 p.m. on data carriers and online via the publication server of the EPO (http://www.epo.org/publication-server). The number of new weekly publications depend on the application and granting activity, but it sums up to 4,000 documents. Since its foundation in 1978 when EP No. 1 was published on the 12th of December, the European Patent Office has published around 3.3 million patent documents. The global intellectual property database espacenet storing all electronically available patent documents worldwide contains currently around 80 million patent documents and is growing yearly by more than 2 million documents.

Publication Standards

The World Intellectual Property Organisation WIPO is defining recommendations for patent documentation standards worldwide. Currently the definition of XML coding of patent documentation is laid down in the WIPO standard ST.36. This definition is well accepted and used around the world for patent publication documents. A patent document consists of bibliography, description, claims, drawings and optionally a search report. Patent offices handle the details in XML coding differently.

EPO Publication Rules

The EPO’s version of a document type definition according to WIPO ST.36 contains nearly 1000 element entities for all sorts of patent markups. More than half of these entities are due to all different sorts of bibliographic information.

The European Patent Office codes all textual content of a patent document in XML, from mid 2009 onward even the search report. Tables are marked up according to the OASIS exchange model and mathematical formulae are coded in MathML besides the storage of the corresponding image in TIFF Fax G4.

The bibliographic information is marked up in so called B-tags. As raw XML code could not be inserted as is into XMP metadata, the European Patent Office has developed a proposal for a PDF profile defining which elements from the existing XML metadata should be transferred into an appropriate XMP metadata tag of a patent related XMP schema. This PDF profile has been presented and discussed with our member states in Europe. Current implementations are limited, but might be introduced within the next years. A WIPO recommendation standard might hopefully be based on this profile.

PDF/A Profile

The current PDF/A profile defines which metadata should be integrated into which XMP tag definition. Where existing XMP metadata schemas like Dublin Core are not sufficient for patent specific metadata information, the PDF/A profile proposes fields in a patent specific XMP metadata schema.

Out of the more than 300 different bibliographic metadata tags of patent documents, around 20 of the most common and important informations are represented in the patent specific XMP metadata schema. Examples are the publication number and date, the application number and date, priority numbers and dates, kind codes, applicant names, inventor names, proprietor names, representative names, international classifications, titles and abstracts.

Difficulties in practice

When it comes to practice, difficulties arise due to different interpretation of a metadata field’s meaning. Example: what is the title of a patent document? The first page of a patent document contains the bibliographic information associated with a code called INID which stands for “Internationally agreed Numbers for the Identification of bibliographic Data”. The content of inid code 12 presented on top of each patent document is stating “European Patent Application” or “European Patent Specification”, which looks like a title. Another INID code numbered 54 has the name title and seems to be more appropriate. But the content of this code is often published in different languages. Which language should be presented in the PDF’s title property? Another example: who is the dc:author of a patent document? The patent office, the applicant, the inventor? And what about dc:creator, dc:producer and dc:description?

Harmonization

Due to different interpretation of the aspects discussed above, the result might be different for different patent offices. Examples are given for DE, CH, MX and EP documents filling the Dublin Core schema with different content. A solution might be the PDF/A profile definition. A patent specific metadata schema avoids the misinterpretation, because patent offices have a well defined common sense with respect to the meaning of an applicant name, but not on the meaning of an author for a patent document. In the same sense application number, publication number, kind codes and so on are well understood. The PDF/A profile might help to solve this misinterpretation by defining also the content for PDF properties and other XMP schemas like Dublin Core.

Featured articles

Discover pdfa.org

Key resources