pdf to html performance

I tested on linux 64 the pdf to html functionality (via Convert::ToHtml) and compared to pdftohtml (poppler) for the same set of documents; on average poppler was 10 times better. Is it any reasonable explanation for this discrepancy?

Thanks
Radu

When you compare the conversion performance of PDF to HTML conversion you need to make sure that you are using exactly the same parameters (otherwise you may be comparing apples and oranges).

For example: What are exact parameters that you pass in the call to DocPub (http://www.pdftron.com/docpub/downloads.html) and what are the options you pass for the other solution?

PDFNet which is used in DocPub CLI ( or Convert::ToHtml) is significantly faster than poppler on all counts, however things such as resolution/DPI, flattening, JPEG vs PNG output, text optimization parameters, … all have significant performance implications.

Another think to bare in mind is the conversion quality. There are many ways to convert PDF to HTML (http://blog.pdftron.com/2013/08/08/how-to-integrate-a-pdf-viewer-in-html5-apps/). For example you could just rasterize PDF to PNG or SVG and wrap it in HTML, or you could use a quick and dirty text/graphics separator, or something that produces accurate replica for most files.

DocPub CLI (or pdftron.PDF.Convert::ToHtml) is unique in that it fits in the latter category (taking care of blending & transparency, overlapping text convent, optimizing text runs, etc.) and can produce accurate output for any PDF (rather than working on a PDF subset). For a bit more info about our conversion process see http://blog.pdftron.com/2013/11/15/high-quality-epub-html-from-pdf/.

ref: PDFNet which is used in DocPub CLI ( or Convert::ToHtml) is significantly faster than poppler on all counts

pdf2htmlEX ( based on poppler ) renders
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
in 56 seconds (144DPI, png)
pdftron sdk does it in 1:30 (72DPI, png, simplifytext ON)

http://www.openoffice.org/sc/excelfileformat.pdf
pdf2htmlex: 13sec
sdk: 32

One example I found to be in favour of the sdk (http://dl.fullcirclemagazine.org/issue81_en.pdf) using the above settings:

pdf2htmlex: 48
sdk: 15

but after reducing the resolution to the same level as the one used by sdk (72) the difference is no more so impressive:

pdf2htmlex: 18
sdk: 15

I used publicly available pdfs so the tests can be easily reproduced - I am trying to find out is if I’m not doing a systematic error that prevents me attaining the full potential of the product.

It looks like your numbers are off due to different conversion settings. For example, if you use PNG as a default for image background, it will be much slower compared to JPEG).

I re-tested your test file (pdf_reference_1-7.pdf) on our end and found that DocPub/PDFNet is actually significantly faster:

Test environment: Windows 7, 64bit, 16GB RAM, CPU i7-3.4 Ghz

Download DocPub (http://www.pdftron.com/docpub/downloads.html). Btw. the perf of pdftron.PDF.Convert.ToHtml() should be identical (if you use all the same options). The command-line was:

docpub64 -f html --time --dpi 144 --flatten off --prefer_jpg pdf_reference_1-7.pdf

It took me 79.74 seconds.

Note: Undocumented option ‘–time’ option reports the conversion time. Given that the GPL solution doesn’t do any flattening the (–flatten) option should be disabled, thought for your test file it would not make a significant difference.

Running:

pdf2htmlEX -o EX --split-pages 1 pdf_reference_1-7.pdf

took 119.5 sec in the best run (out of 5). Default resolution is 144


So based on the above test DocPub/PDFNet is 67% faster than the other solution! For anything a bit more serious (e.g. bit images, shadings etc) DocPub/PDFNet will be even faster.

For example for a typical magazine (http://goo.gl/UHACFz):

docpub64 -f html --time --dpi 144 --flatten off SRD0512.pdf

takes 52.491 seconds.

pdf2htmlEX SRD0512.pdf

takes 450 seconds

so docpub is 857% faster!


Given that we have cleared out that DocPub/PDFNet is faster, it is still unclear why we should concern ourselves so much about performance?

After all if the output is not accurate or reliable, it does not matter how long the conversion takes. From this perspective the two solutions can’t be compared. To give you an idea, take a look at page 1142 in your test file:

The correct output produced by PDFNet/DocPub is :

the other output is: (note that colors are off, the pattern is not clipped, etc)

Cases such as partially clipped text, transparency (e.g. text covered with semi-transparent graphics), overprint, font subst, and many other special cases are not handled properly (http://blog.pdftron.com/2013/11/15/high-quality-epub-html-from-pdf/). The conversion may seem to work on some files, but it is eventually not suitable for production use.

As another example, see attached INITIAL.PDF. Text in DocPub/PDFNet HTML output can be selected/searched, it displays with correct font. pdf2htmlEX displays text as images with incorrect fonts etc. These samples are just a tip of the iceberg. You may need to run extensive time-consuming tests (hopefully automated) on a functional test suite in order to detect this kind of issues. Unfortunately it is not as simple as running perf test L, but we do this for every product release … J

Ouch… if you look at pdf2htmlEX output from SRD0512.pdf you will see more re: what I mean about reliability and quality:

Here are some screenshots:

Page 1: Text is off.

Page 2: Superfluous vertical lines at beginning of each line

Random white lines:

Page 24: Text is incorrectly positioned and overflows columns…

etc …

First thank you for your help! These are indeed valid points to consider.
Having your benchmark numbers and after rerunning the tests (with the same results) I am almost sure that the discrepancy is due to the pdf2htmlEX binary you are using - windows is not an officially supported platform; AFAIK the binaries are produced on cygwin using a compiler far from production quality, etc. When running on linux you’ll definitely see the difference (should be between 1.7 and 2 times faster)
Now there might be a performance problem with the pdftron too - in the sense that windows version seems to be more optimized than linux/mac. I see a difference of 60-70% between the results obtained on my machine (2.5 GHz i7-2860QM) and yours (i7 3.4 GHz) that are not totally justified by the difference in processor (not even considering a high end one - I would have expected something like 40% difference). Note: My assumption was confirmed by running a windows version of docpub in a VM on the same machine - the result was 108 sec in line with the processor difference (35%). This means that the 20% difference (130 vs 108) between linux and windows is pure optimization.
These considerations put the performance (but not the quality of the transformation :slight_smile: ) in the case of the SRD doc in a different perspective: pdfhtml: 200, docpub : 100 sec (for pdf2html resolution has a big impact - at 72 dpi the time is 100 sec too).
To conclude it seems that the more complex the document the better the docpub performance (plus the quality of the rendering) - but for simple docs the performance penalty introduced by the superior rendering quality is important.

Coming back to our use case: the documents are legal documents (less complex in nature but with possible positioning issues) having form fields (including signature fields) that have to be displayed and extracted (type and position) so that external HTML widgets might be layered on top. The transformation is part of a more complex server side activity running with performance guarantees. The main problem of such a transformation is that is totally cpu bound and reducing the time spent in this step is of paramount importance.

As far as I understand docpub has no support for forms. Is it possible to do it using the sdk or there is any plan to support them in the near future? I can provide documents or we can discuss this matter further offline

Thank you again for your help!
Radu

Docpub to HTML has no support for forms. However, based on your description of your situation, you should definitely take a look at our WebViewer technology, which uses HTML5 to render the file, and it includes lots of support for annotations and forms. You can create a WebViewer compatible file by calling ‘docpub -f xod <pdf_file>’

http://www.pdftron.com/webviewer/demo/samples.html
http://www.pdftron.com/webviewer/showcase
http://www.pdftron.com/webviewer/index.html

Xod conversion also supports streaming … (see WebViewerStreaming sample) so you can display documents as they are converted (i.e. you do not need to wait for the whole conversion to finish before viewing the doc.