Mimotek Structuriser is a suite of tools that extracts content at an article level from newspaper or magazine PDF files.
The workflow involves three stages:
- An automatic segmentation tool processes the PDF page. This uses visual clues to make a best guess as to the way that the page is divided into articles. The results of this process are added to the PDF file, to produce a Structured PDF file.
- Since the first stage may not have resulted in totally correct segmentation, the structured PDF page is displayed in an editor for manual correction. The corrected structure is written back into the structured PDF file.
- The content is exported from the structured PDF file in whatever format is required. For example, text as XML, images as JPEG and complete articles as individual PDF files.
There are two main software components:
Mimotek Structuriser Server
This is an automatic process that monitors a hot folder and processes any PDF file that it finds there. Its primary purpose is to segment PDF pages into articles, to produce a structured PDF file.
In addition to its main use for page segmentation, the server can also be configured to perform one or more of the following actions on the PDF file:
- Split multi-page PDFs into single page files.
- Convert between single newspaper pages, readers’ spreads and printers’ spreads.
- Optimise the size of a PDF file. This involves reducing the file size by down-sampling images and using the appropriate compression techniques
- Normalise a page by removing any inbuilt rotation
- Extract content from a structured PDF file
- Rasterise a page to generate a bitmap (for example for use as a page thumbnail)
- Adjust page margins.
Mimotek Structuriser ClipEdit
The accuracy of the segmentation calculated automatically by Structuriser Server depends on the details of the page design. For a regular page, the segmentation will be generally correct, but on more complex pages it may be difficult to identify articles accurately without manual intervention.
Structuriser ClipEdit allows an operator to view, and if necessary edit, the result of the segmentation applied by Structuriser Server. ClipEdit reads the Structured PDF files that have been created by Structuriser Server, and saves any edits applied by the operator back into the Structured PDF file.
Learn more at: http://www.mimotek.com/index.php/products/