How to I extract text and other info from web sites that use PDF format?

Aaron_Gravesdale · September 28, 2007, 10:10pm

Q:

I would like to be able to get information from web site which uses
PDF format and then extract specific info from that site and insert it
into corresponding Database table. For example my program should be
able to brows through specific web site, pick specific info and insert
it into DB, of course my program should do all this automatically user
should only provide URL . Can PDFNet SDK Toolkit help me to develop
this kind of project using Visual Studio 2005 C Sharp or VB .NET and
if so is there any sample to demonstrate this.

A:

You can use PDFNet SDK (www.pdftron.com/net) to extract textual and
graphics information from PDF documents. PDFNet SDK is available as
a .NET component and a free evaluation version is available for
download www.pdftron.com/downloads/PDFNetDemo.zip).

As a starting point for your project you may want to take a look at
TextExtract sample project (www.pdftron.com/net/
samplecode.html#TextExtract; specifically the use of TextExtractor
class). Also you may also want to take a look at ImageExtract,
ElementReader and ElementReaderAdv sample projects.

To fetch PDF from the web, you can use standard .NET Framework API (or
some C/C++ network library in case you are using PDFNet for C++) to
download a remote file in memory or to a temporary file.

Something along the following line:
FileStream sourceFile = new FileStream(@"…", FileMode.Open);
long sz = sourceFile.Length;
byte[] getContent = new byte[(int)FileSize];
sourceFile.Read(getContent, 0, (int)sourceFile.Length);
sourceFile.Close();

You can then open PDF memory buffer as illustrated in PDFDocMemory
sample project
(http://www.pdftron.com/net/samplecode.html#PDFDocMemory).