How can I convert PDF to HTML?

Aaron_Gravesdale · September 25, 2008, 7:37pm

Q: I would like to know if it is possible to generate html files from
pdf files using PDFTron PDF library?
-----
A: Since PDFNet is a general purpose PDF toolkit so it doesn't have a
built-in PDF to HTML utility function (at the time of this writing),
however it should be fairly simple to implement such conversion.

The implementation of PDF to HTML converter can be broken into two
parts:

1) Text extraction and HTML generation.
2) Background image generation.

The first part can be implemented along the lines of code sample #3 in
TextExtract sample project (http://www.pdftron.com/net/
samplecode.html#TextExtract). This sample project converts PDF pages
to a custo XML syntax which is in many aspects similar to HTML. With
minor alterations to this code you can get basic HTML output.

Because HTML does not support complex graphics model you would need to
normalize all images and vector art to a background image. To generate
the background image you can use PDFDraw class (see
http://www.pdftron.com/net/samplecode.html#PDFDraw). The only
difficulty is that the background image should not contain text. Using
PDFNet you can remove all text from a source PDF and pass this
temporary page to PDF rasterizer. For some ideas of how this can be
implemented, please take a look at ElementEdit sample project (http://
www.pdftron.com/net/samplecode.html#ElementEdit) as well as "How do I
separate PDF page into text and image layers?" in this forum.