RESOURCE

Deriving HTML from PDF

A usage specification for tagged ISO 32000-2 files

From the introduction…

In the modern world of small devices, IoT and connected systems, where interchange and reuse of data is critical, it is reasonable to question the continued relevance of PDF’s core value proposition. In particular, search engines, machine learning and artificial intelligence systems focus on accessing information contained in documents over visual representation. In other cases, document producers wish to deliver data in a form that is suitable for automated processing while using a PDF file as a record for trust purposes. End users want electronic documents that adapt smoothly to viewing on diverse small devices.

By describing the algorithm that produces conforming HTML from a tagged PDF, this document shows how well-tagged PDF documents, containing both traditional fixed-layout content and the semantic structures leveraged by modern devices and software, can be reliably and consistently reused as HTML to support better user experiences and renew PDF’s value proposition.

HTML was chosen as a derivation target because HTML is consumed on all platforms and supported by all major vendors. With small modifications, developers can use this document to export content from well-tagged PDF to any format.

Cover of Deriving HTML from PDF.

RESOURCE INFO

Download the PDF


INFO

Comments on this document are welcome; please email derivation@pdfa.org.

Published by the PDF Association
June 11, 2019


PDF in general PDF/UA



SHARE THIS RESOURCE

RELATED RESOURCES

PDF/Raster 1.0

July 20, 2017

PDF/Raster

© 2019 Assosiation for Digital Document Standards e.V. | Privacy Policy | Imprint