High-level, PDF logical structure extraction using PDFNet.

Aaron_Gravesdale · December 28, 2006, 8:09pm

Q:

I am working with your PDFNet component, a very impressive piece of
work indeed. I have one question - if I have a task of extracting
text from a PDF to another format (Rich Text, ASCII, etc), can your
component preserve the structuring of the tables in the document, and
not just extract delimited text?
---
A:

Thank you for your compliment. PDFNet SDK (www.pdftron.com/net) can be
used to extract any information present in the document. If the PDF
document contains structure information (i.e. if it is 'tagged'),
PDFNet can also be used to extract the logical structure.
Unfortunately, PDF documents generated using most third party tools are
missing logical structure, and the only approach is to reconstruct the
logical structure using some document analysis technique (see
www.pdftron.com/net/faq.html#struct_01,
www.pdftron.com/net/faq.html#text_00).

Also, we are in the beta stage testing of a new add-on module for
PDFNet for document analysis, and will offer it as part of the SDK in
the near future. Because no document analysis approach is 'perfect',
PDFNet users will still be able to use their own implementations.

Aaron_Gravesdale · December 28, 2006, 8:59pm

If you only need to extract/write 'tagged' PDF documents we can provide
you with sample code.

Basically you would use
ElementReader(http://www.pdftron.com/net/samplecode.html#ElementReader;
or ElementWriter for PDF creation) to extract various PDF Elements
from the page. In particular, you will be interested in the following
element types:

- e_marked_content_begin - marks the beginning of marked content
sequence (BMC, BDC)
- e_marked_content_end - marks the end of marked content sequence (EMC)

- e_marked_content_point - designate a marked-content point (MP, DP)

If you encounter e_marked_content_begin element, you can obtain BMC
dictionary using element.GetMCProperyDict(). There is also
element.GetMCTag() method, in case you encounter a marked content
point.