Converting PDF to HTML and then back to PDF from HTML

Ivanho · October 19, 2012, 9:29pm

Q:
We are using PdfNet for converting pdf to html (http://www.pdftron.com/pdfnet/samplecode/Pdf2Html.cs) and then back from html to pdf (http://www.pdftron.com/pdfnet/samplecode.html#Html2Pdf). While converting from pdf to html some words are overlapping (Please check the attached file). Moreover hyperlinks in PDF are also not getting part of generated HTML. Will this issue persist in your licensed version?

Also, we want to purchase licensed version of PDFNet. Do we also need to purchase one of extra add-ons (http://www.pdftron.com/pdfnet/features.html#Convert) for license for PDF to HTML conversion?

A:

The currently recommended approach for PDF viewing / annotation / collaboration in a web browser is to use the WebViewer SDK (http://www.pdftron.com/pdfnet/webviewer/demo.html). For a quick test, use http://s84786.gridserver.com/website/demo/bookstore/details.php or Cloud API (http://www.pdftron.com/pdfnet/cloud/started.html).

The cool thing about the WebViewer works on all platforms, browsers, and viarions devices (including iPad/iPhone, Android, Windows 8 Surface, old non-HTML5 browsers, etc…). Unlike other HTML solutions (e.g Google docs?) the content is not rasterized and the system does not rely on server side rasterization or proprietary systems/APIs etc.

The reason for recommending WebViewer instead of straight PDF → HTML → PDF conversion is that the latter is impossible to implement without significant loss of information (i.e. HTML without Canvas support doesn’t support most PDF features that are required for accurate document reproduction).

Regarding Pdf2Html sample, the only intent of the sample is to show how to use core PDFNet API to implement a very basic PDF to HTML converter (e.g. PDFNet users that want to implement a custom import filer for their apps). For the reasons outlined above, the sample was not designed to be a bullet-proof solution nor it is meant to be used in production. Any behavior you see in trial mode is what you’ll see after licensing (except of course no trial mode watermarking). For example, one of limitations is due to font substitution. In PDF fonts are typically embedded, which guarantees accurate text reproduction. In case of Pdf2Html sample text locations are correct, however in some cases (where font match is not found) substituted font has larger advance widths words can grow and start overlapping each other. You could verify this by adjusting the font size in the converter (e.g. scaling it down 30% or more). You could extract embedded fonts and normalize them to WOFF (a format compatible with most browsers; for more info please see https://groups.google.com/d/topic/pdfnet-sdk/weHNRhmlvn4/discussion) then use these ‘web fonts’ instead of default fonts. But there are many other issues with plain PDF to HTML conversion that simply can’t be worked around, unless you are ok with a totally rasterized page. The goial of WebViewer Development Platfrom is to solve this problem.

Having said this you could extend Pdf2Html sample with extra features. For example, if you would like to preserve PDF links in HTML you would use PDFNet annotation API (as shown in Annotation sample - http://www.pdftron.com/pdfnet/samplecode.html#Annotation) to extract the link regions (annot.GetRect() → Rect) and to add href URL to an HTML DIV floating on top of the content underneath:

if (annot.GetType() == Annot.Type.e_Link) {
Action action = lk.GetAction();
if (action.GetType() == Action.Type.e_GoTo) {
Destination dest = action.GetDest();
if (dest.IsValid()) {
int page_num = dest.GetPage().GetIndex();
System.Console.WriteLine(" Links to: page number {0:d} in this document", page_num);
}
}
else if (action.GetType() == Action.Type.e_URI) {
string uri = action.GetSDFObj().Get(“URI”).Value().GetAsPDFText();
}
}

Do wealso need to purchase one of extra add-ons

In case you are happy with Pdf2Html sample and you are not using anything from ‘pdftron.PDF.Convert http://www.pdftron.com/pdfnet/html/classpdftron_1_1PDF_1_1Convert.html’ namespace, the Core PDFNet API will suffice (i.e. you do not need to purchase any extra add-ons).
If you are planning to use the WebViewer API in PDFNet (i.e. pdftron.PDF.Convert.ToXod()) you would need to obtain a WebViewer Publisher Add-on license. Alternatively, as a potentially more cost effecitve option, you could use Cloud API (http://www.pdftron.com/pdfnet/cloud/started.html) instead of using/hosting PDFNet on your own servers.