Text extraction from PDF with a correct reading order

Q:

We are looking for solution for parsing PDF files in .NET 4 (C#) application. The idea is to convert a PDF file page to text first. What we need is to get text complying with the order of elements in PDF page. As far as I see, there is TextExtractor class in PDFNet SDK to get text from PDF. However, it produces text in the order of “flows” rather than the order of native elements. For example, the attached PDF file gets converted into the following text:

Enclosure 4.1

Sequence Startup of the Air System

(Reference Use)

1. Limits and Precautions

2. Initial Conditions

End of Enclosure

ASOP/4/1200/05

Page 1 of 1

Unit 4

The text above does not fit our parsing algorithm. To get the proper text we can read the same file using low-level API (ElementReader class), and then combine the elements together so we get text similar to the following (the only difference is the order of the lines, but it is important for us):

Enclosure 4.1

ASOP/4/1200/05

Sequence Startup of the Air System

(Reference Use)

Page 1 of 1

Unit 4

1. Limits and Precautions

2. Initial Conditions

End of Enclosure

This text is ok for us, however, we consider using low-level API quite complicated (in particular, we sometimes miss whitespaces by using low-level approach), and, moreover, iTextSharp and Aspose.PDF libraries already feature text extractors that produce text in the format we need.

So the question is the following: is there a simple way to get the text #2 (one keeping the native order of text elements) by PDFNet SDK without using low-level API (i.e. ElementReader)?

A: