Metadata has been described as the business card of a particular digital document. Metadata often comprises a set of properties, where each property has specific meaning in the context of the document, such as the title and creator of a PDF document or the GPS position where an image was taken. Metadata plays a crucial role for handling digital data during its lifetime.
The common format for metadata, Extensible Metadata Platform (XMP), is based on XML and was designed by Adobe in 2001 and standardized as ISO 16684-1:2012. XMP metadata travels with the file, and can be embedded in many file formats besides PDF, such as TIFF or JPEG. With XMP, metadata will even survive format conversions, e.g. from scanned TIFF to PDF. XMP is implemented in all Adobe publishing products and supported by dozens of independent software vendors and user groups.
Metadata properties are grouped in schemas. In addition to predefined schemas (e.g. Dublin Core), custom schemas can be defined to cover company- or industry-specific metadata requirements. There are various ISO standards which specify PDF subsets for certain application domains, such as archiving (PDF/A) or printing (PDF/X-4 and PDF/X-5). They all include the use of XMP metadata (except for the older standards PDF/X-1 and PDF/X-3), even mandatory in most cases.
PDFlib products offer extensive support for XMP in PDF :
PDFlib product family: With PDFlib you can create PDF documents with XMP metadata on document, page or image level. PDFlib adds user-friendly support for XMP extension schemas according to PDF/A without any struggle with XMP internals. Advanced users can directly feed all predefined XMP metadata schemas to PDFlib to be included in the generated PDF documents. The output is guaranteed to conform to PDF/A. Since PDFlib is available on all relevant operating systems and does not require any third-party products, it brings XMP support to all platforms.
Injecting XMP in PDF with PLOP and PLOP DS: With PDFlib PLOP and PLOP DS you can insert XMP in existing PDF documents in case PDF documents do not contain all required metadata properties. This is particularly useful in PDF/A workflows since XMP support in PLOP and PLOP DS is PDF/A-aware. For example, custom XMP with extension schemas can be injected in PDF/A documents from workflows which do not support extension schemas.
Extracting XMP with pCOS: PDFlib pCOS is the PDFlib tool for retrieving all kinds of information from PDF documents. pCOS offers a simple programming method for extracting XMP metadata from PDF on document, page or image level. XMP metadata is normalized to Unicode so that you dont have to worry about encoding issues. XMP retrieval works regardless of compression, encryption, and PDF object structure. As pCOS follows the PDF object structure in all cases, the correct XMP metadata blocks are always retrieved.
Searching for XMP metadata with TET PDF IFilter: PDFlib TET PDF IFilter implements Microsofts IFilter interface and makes XMP metadata searchable with various Microsoft and third-party desktop and enterprise search products, such as Microsoft Search, Microsoft SharePoint, or SQL Server. In addition to page contents, TET PDF IFilter indexes XMP metadata as well as standard or custom document info entries. TET PDF IFilter optionally integrates metadata in the indexed raw text. As a result, even full-text search engines without metadata support (e.g. SQL Server) can search for metadata.
PDFlib Text and Image Exctraction Toolkit (TET): includes XMP in XML that is created from PDF documents.
Thomas Merz is managing director of PDFlib GmbH, the software company he founded 2000 in Munich (Germany). Since obtaining his master’s degree in mathematics in 1990, Thomas Merz has been occupied with computer graphics and cryptography. His membership in Adobe’s Developers Association brought him in contact with the earliest versions of Acrobat and PDF. Subsequently he authored several books on …