Text extraction from PDF with a correct reading order

Ivanho · September 20, 2012, 8:41pm

Q:

We are looking for solution for parsing PDF files in .NET 4 (C#) application. The idea is to convert a PDF file page to text first. What we need is to get text complying with the order of elements in PDF page. As far as I see, there is TextExtractor class in PDFNet SDK to get text from PDF. However, it produces text in the order of “flows” rather than the order of native elements. For example, the attached PDF file gets converted into the following text:

Enclosure 4.1

Sequence Startup of the Air System

(Reference Use)

1. Limits and Precautions

2. Initial Conditions

End of Enclosure

ASOP/4/1200/05

Page 1 of 1

Unit 4

The text above does not fit our parsing algorithm. To get the proper text we can read the same file using low-level API (ElementReader class), and then combine the elements together so we get text similar to the following (the only difference is the order of the lines, but it is important for us):