How? What clipping to apply to TextExtractor lines or words?

We have made a converter that uses PDFTRON to extract info from PDF, and use that info to call our proprietary render library.

I originally did so by processing elements of the page. This had short comings for some features related to kerning, spacing, and placement of text. We rewrote to use the textextractor for only the text parts of the page, and we use the element processor for all the rest.

I am faced now with the input PDF document supplied by a 3rd party software belonging to the client, which clips a column of text in a tabular presentation to prevent one column from over writing the adjacent column to the right.

When working with text lines and words returned by the text extractor, how can I determine the clipping that should be applied to either when I render output with our converter?

  • Lee Gillie, CCP - Online Data Processing, Inc. - Spokane, WA

The TextExtractor class is not really appropriate for rendering. For example, it simplifies the styles of the text, and sometimes reports the dominant style, especially at the higher abstractions such as Line class. For accurate rendering you should be using the ElementReader interface.