How do I extract text from PDF in the "reading order"?

Aaron_Gravesdale · March 4, 2009, 2:49am

Q: It sometimes appears that the method TextExtractor.GetAsText()
returns text that is in the internal physical structure of the PDF
document, not the structure that corresponds to how the document
prints out as a PDF document. I need to be able to extract the text in
"reading order" instead of PDF "layout order". Is there a way for me
to do this?
------
A: TextExtractor.GetAsText() does not return text as it is stored in
the internal physical structure of the PDF document. Instead this
method attempts to reconstruct the "reading order". Unfortunately this
is a non-exact, error prone process. For many PDF documents the method
returns correct reading order, however there will be always some files
(especially for multi-column or scattered text) for which the
reconstructed reading order is incorrect. If you send us a sample file
(to support at pdftron), we will take a look into it and will try to
improve the text recognition algorithm.

Please keep in mind that using TextExtractor you can also access text
flows, blocks, lines, and words (along with their positioning and
styling information). You can use this information to build your own
text reading order.