Implementing a PDF semantic engine

Aaron_Gravesdale · June 13, 2008, 6:55pm

Q: I'm working on a semantic engine and I'd like to introduce a
PDFdocument manager able to recognize the tables in a pdf document.

Is PDFNet library currently able to extract these information?
Otherwise, is it possible to extract spatial information about words,
images and graphical shapes?
----
A: You can use PDFNet SDK (www.pdftron.com/net) to extract any
information from PDF documents (including text, images, positioning
information, graphics state attributes, etc).

As a starting point for your project you may want to take a look at
the following samples:

ElementReader: http://www.pdftron.com/net/samplecode.html#ElementReader
ElementReaderAdv: http://www.pdftron.com/net/samplecode.html#ElementReaderAdv
TextExtract: http://www.pdftron.com/net/samplecode.html#TextExtract

For PDF documents that contain logical structure (i.e. explicit
semantic information) you can also use a high-level logical structure
API: see LogicalStructure (http://www.pdftron.com/net/
samplecode.html#LogicalStructure).