Extending PDF to Flash/Image conversion with text extraction (to implement searchable images).

Aaron_Gravesdale · March 11, 2008, 6:33pm

Q: First, I want to thank you for the PDFTron (PDF2Image) product. We
love it. I have a question for you. We want to be able to search the
PDF from our flash application and get the coordinates from the
keyword. I would think that we could get the coords for that word and
then transfer them to the flash app and highlight the searched word?
Maybe we have to use your PDF2SVG converter to get those coords?
----
A: Thanks for your compliments regarding PDF2Image. Regarding the text
extraction there are couple of options.

- You may want to use PDF2Text (Windows: http://www.pdftron.com/downloads/pdf2text.zip;
Linux: http://www.pdftron.com/downloads/pdf2text.tar.gz) which is a
command-line application similar to PDF2Image. The primary purpose of
PDF2Text is to provide simple to use text extraction from PDF
documents. Besides plain text extraction, PDF2Text can also extract
text positioning information, text style info, XML, PDF text runs,
etc.

- If you are looking for a programmatic way to extract text (and other
content) from PDF documents you may want to take a look at PDFNet SDK
(http://www.pdftron.com/net) which is available as a .NET component,
as a JAVA library, and as a cross-platform C/C++ library. PDFNet SDK
includes all of the functionality of PDF2Text and PDF2Image, as well
as low-level content extraction API-s that can be used to implement
custom conversions and content extraction (among many other features -
www.pdftron.com/net/features.html). As a starting point you may want
to take a look at TextExtract sample project:
http://www.pdftron.com/net/samplecode.html#TextExtract.