RESOURCE

Deriving HTML from PDF

A usage specification for tagged ISO 32000-2 files developed by the PDF Association’s Deriving HTML from PDF TWG.

From the introduction…

In the modern world of small devices, IoT and connected systems, where interchange and reuse of data is critical, it is reasonable to question the continued relevance of PDF’s core value proposition. In particular, search engines, machine learning and artificial intelligence systems focus on accessing information contained in documents over visual representation. In other cases, document producers wish to deliver data in a form that is suitable for automated processing while using a PDF file as a record for trust purposes. End users want electronic documents that adapt smoothly to viewing on diverse small devices.

By describing the algorithm that produces conforming HTML from a tagged PDF, this document shows how well-tagged PDF documents, containing both traditional fixed-layout content and the semantic structures leveraged by modern devices and software, can be reliably and consistently reused as HTML to support better user experiences and renew PDF’s value proposition.

HTML was chosen as a derivation target because HTML is consumed on all platforms and supported by all major vendors. With small modifications, developers can use this document to export content from well-tagged PDF to any format.

Lead author: Roman Toda (Digital Documents)

Contributors: Boris Doubrov (Dual Lab), Olaf Drümmer (callas software), Matthew Hardy (Adobe), Duff Johnson (PDF Association), Leonard Rosenthol (Adobe)

Cover of Deriving HTML from PDF.

RESOURCE INFO

Download the PDF

Comments on this document are welcome; please email derivation@pdfa.org.

Published by the PDF Association

Published: June 11, 2019

PDF Reuse

Featured articles

Discover pdfa.org

Key resources

Get involved

Deriving HTML from PDF