Facebook
Twitter
YOUTUBE
LINKEDIN
XING
About the contributor
PDF Association

Mission Statement: To promote Open Standards-based electronic document implementations using PDF technology through education, expertise and shared experience for stakeholders worldwide.
More contributions
The Power of the Page

It’s a question that vexes vendors of web-based solutions everywhere: why do people still insist on PDF files? And why does PDF’s mindshare keep going up? “PDF is such antediluvian technology!” they say. “It’s pre-web, are you kidding me? It’s so old-f …

PDF Association technical resources: an overview

PDF is PDF because files produced with one vendor’s software can be read using a different vendor’s software with no loss of fidelity. Interoperability is key to our industry. The PDF Association is a international membership organization dedicated to …

2022: The last year of paper for records-keeping

NARA (The National Archives and Records Administration) is the final depository for the long-term records generated by all other agencies of the U.S. Federal Government. The agency has a key role in preserving the cultural history of the republic as we …

PDF 2.0 examples now available

The PDF Association is proud to present the first PDF 2.0 example files made available to the public. Created and donated to the PDF Association by Datalogics, this initial set of PDF 2.0 examples were crafted by hand and intentionally made simple in construction to serve as teaching tools for learning PDF file structure and syntax.

PDF 2.0 interops help vendors

The PDF 2.0 interop workshops included many vendors with products for creating, editing and processing PDF files. They came together in Boston, Massachusetts for a couple of days to test their own software against 3rd party files.

What is Google parsing?


Google filetype search for Opinion

As we noted the Sophos blog has a long piece about modern-day link-farming with PDF documents. Less scrupulous marketers have discovered that Google trusts PDF documents more than HTML pages; they’ve been “poisoning Google search results” accordingly.

The notion that PDF document authors are innately pure of heart as compared to HTML pages is doubtless being re-evaluated right now, especially since PDF files are an enormous proportion of important web content, and interest in PDF continues to grow.

Apart from tweaking search algorithms so that PDF files aren’t receiving undue credit just because they are PDF files, what should Google (or other search engine developers) do about PDF? What are, for example, the benefits awaiting search-engine and other application developers that leverage high-quality PDF files?

What’s possible if you handle PDF qua PDF?

Once the PDF specification is fully supported (it’s an ISO standard; it won’t bite!) lots of things get both easier and better.

An idea for Google and other search engine developers: to really impress people with your acumen in handling PDF documents, go beyond simply treating PDF as a page-description model, and support high-quality tagged PDF!

What might be possible if search engines were savvy to PDF’s model for semantics and logical reading order?

  • Inputs for the indexer could be mapped to HTML constructs complete with semantic information (headings, tables, lists, alt. text, etc.) and a reliable logical reading order. Alone, this change would dramatically enhance search functionality with PDF, enabling top-quality, deep and rich indexing and reporting such as:
    • leveraging headings and other semantics to understand document structure and content
    • locating all content in a specific language in a multilingual document
    • utilizing author-provided alternative text for images
    • reliably extracting tabular data from tables
    • reduced reliance on heuristics; re-use the author’s intent with confidence
    • deliver comprehensive accessibility solutions for PDF content
  • In principle, the same algorithms that detect and defeat link farms in HTML would be much better able to detect them in PDF documents, especially tagged PDF.

Why not favor tagged PDF over plain page-description PDF?

Perhaps Google should indeed favor tagged PDF (and especially, files claiming conformance with PDF/UA) as it does for responsive websites, and for essentially the same reasons.

Benefits beyond search engines

User’s experience of PDF on the web

Although PDF is a page description format it can include all the necessary instructions to allow consuming software to make other choices. Supporting tagged PDF (ISO 32000-1:2008, 14.8, download it for free), by itself, would generate other fairly dramatic new features for browsers.

Accurate abstraction of tagged PDF’s content to vanilla HTML, much as callas’s pdfGoHTML does today (sadly, it requires Adobe Acrobat), would facilitate total flexibility in using tagged PDF on mobile devices. Apple’s iOS browser, Safari, effectively does this today on some HTML pages with “Reader View” – why not also for PDF?

Beyond the browser

Besides improved indexing for search and the ability to reliably reuse PDF content in web browsers there are many ways in which complete support for PDF technology would deliver substantial value to content management systems and end users alike:

  • Form data (AcroForm or XFA);
  • XMP metadata for both documents and page-objects (per image, etc);
  • Embedded files;
  • Annotations;
  • Encryption and digital signature features;
  • …and much more.

It’s all ISO-standardized, and thus, inherently interoperable.

Conclusion

PDF is here to stay, and tagged PDF offers tremendous advantages for both search and re-use applications. It’s high time that search engine, browser and other application developers decided to think again about the crusty old format users have loved for over 20 years.

For developers who need to interface with PDF, a 2-day technical education conference, October 19-20, San Jose, CA

 


Tags: Google, search
Categories: Document Management, Search