Text Extraction using PDFNet

Aaron_Gravesdale · March 20, 2008, 9:25pm

Q: We want to select a particular text selection within a PDF page and
identify the elements that constitute the text. After writing the code
and running it we observed the following:

There is apparently no logic in PDF to how text elements are defined.
For example, in a test PDF, a string 'Financial Consultation' is
coming as 2 elements 'F' and 'inancial Consultation' . Similarly there
is no method we could observe to how elements are formed in PDFNet.
-----
A: Based on your description I assume that you are using ElementReader
class to extract 'Element' objects from the page. In this case,
ElementReader will return the content as it is defined in the PDF page
content stream (i.e. e_text element directly corresponds to a Tj
element). In PDF format text object are usually _not_ cleanly
organized in words, sentences, paragraphs, etc. Instead e_text element
represents a 'text run' which is used to represent a sequence of text
glyphs using the same font and graphics attributes. For example, you
may have a single word that consist of letters in various fonts and
styles. In this case each letter would correspond to a separate text-
run. Also you may encounter text-runs that contain multiple words
separated by spaces. PDF format also does not guarantee that the text
will be presented in the reading order. So, you may encounter cases
where text is drawn from right to left or even in the random order.

The most straightforward approach to extract words and text from text-
runs is using pdftron.PDF.TextExtractor class (as shown in TextExtract
sample project - http://www.pdftron.com/net/samplecode.html#TextExtract).
TextExtractor will assemble words, lines, and paragaraphs, remove
duplicate strings, reconstruct text reading order, etc. Using
TextExtractor you can also obtain bounding boxes for each word, line,
or paragraph (along with style information such as font, color, etc).
This information can be used to search for corresponding text elements
using ElementReader.

If the input PDF document is 'tagged PDF' (i.e. it contains logical
structure) you could extract the content using PDFNet API related to
logical structure (in pdftron.PDF.Struc namespace). As an example, you
may want to take a look at LogicalStructure sample (http://
www.pdftron.com/net/samplecode.html#LogicalStructure). Unfortunately
most PDFs out there are not 'tagged' or do not contain useful logical
structure.