Q: What is the difference between PDF2Text and PDFNet SDK?
A: The main difference is that PDF2Text is a simple to use command-
line application, whereas PDFNet SDK (http://www.pdftron.com/net) is a
general purpose Software Development Toolkit (SDK). The advantage of
PDF2Text is that you don’t need to be a developer in order to use it,
however it does not have all of the features that are available in
PDFNet SDK (http://www.pdftron.com/net). Also PDF2Text itself is built
using PDFNet SDK API.
When it comes to text extraction from PDF, there are several things to
keep in mind. Most PDF documents do not store logical structure.
Logical information is the meta-information that groups graphical page
elements into a hierarchical structure. For example, a document is a
collection of text flows, a flow is a list of paragraphs, a paragraph
is a list of lines, a line is a list of words, a word is a list of
text runs, etc. In order to properly extract text, a text extractor
must reconstruct parts of the missing logical structure. Because this
information is not explicitly specified, the reconstruction is an
error prone process (similar to the concept of OCR -
Another thing to keep in mind is that these days PDF documents are
generated using all types of buggy PDF creators and may contain custom
encoded text and broken Unicode mapping tables. These types of files
may present problems to text extraction engines even though a document
may appear completely fine on-screen.
Having said this both PDF2Text and PDFNet SDK employ state of the art
techniques to get the best possible text extraction results. Besides
high quality text extraction, additional attributes that set apart
these products from other offerings are efficiency, and robustness and