Text Extraction from PDF that maintains the PDF text layout

Aaron_Gravesdale · November 29, 2010, 5:25pm

Q: Our company is looking for the PDF engine, which will process PDF
files in our program product.
The features we need:

Unicode text extraction with the same layout as in original PDF;
Adding of the Unicode line to the particular position in PDF
(usually bottom) as a “signature”;
Redirection of the “signed” document to the specified printer.

As long as I see, 2nd and 3rd features are present in your SDK, while
the 1st feature I still cannot understand how to implement.

Can you please let me know if your PDF text extractor can give the
same layout as in the original PDF?

A: A plain text document (e.g. in Notepad) does not have a concept of
variable font size, relative placement, or overlapping text.

As a result, in general, it is impossible to preserve the layout when
converting PDF to text. You could use some approximation methods which
may work for some classes of documents.

The simplest approach is to use TextExtractor class as shown in
TextExtract sample project: (http://www.pdftron.com/pdfnet/
samplecode.html#TextExtract)

TextExtractor reconstructs PDF text and can be used to traverse all
content on a page, block-by-block, line-by-line, and word-by-word.
TextExtractor also provides positioning and style (i.e. font, color,
etc) information for each line, word, or glyph. You can use this
information if your output format supports text positioning (e.g. as
shown in XML related code section of TextExtract sample).
Alternatively if you are exporting to a plain text file and need to
approximate the layout you could use this information to place extra
space characters between words.