How can I read the tables or tabular structure of text stored in PDF?

Aaron_Gravesdale · November 30, 2007, 11:56pm

Q: How can I read the tables or tabular structure of text stored in
PDF?
-------
A:
If you are dealing with 'tagged PDF' (i.e. PDF with explicit logical
structure), you can use high-level logical structure API to extract
this information.

For an example of how to use this API please see LogicalStructure
sample project (http://www.pdftron.com/net/
samplecode.html#LogicalStructure). Samplers for C# and VB.NET will be
part of the next PDFNet update. Also you can find Java version in the
attachment.

In case your PDFs are missing logical structure, you would need to
recognize that text belongs to a table. For this purpose you could use
pdftron.PDF.TextExtractor class (please see www.pdftron.com/net/samplecode.html#TextExtract)
to extract words and their positioning information. TextExtractor can
also recognize paragraphs, lines, and associated text styles.