Speeding up PDF processing using PDFNet SDK.

Aaron_Gravesdale · January 30, 2009, 12:03am

Q: We are producing consolidated PDF files from TIF scanned images and
XML files containing the OCR data. These are old newspapers.

Technically, we have been able to insert the text into the page using
ElementBuilder and ElementWriter as per the advice given in your help
pages. The idea is to make the papers text searchable by adding the
text behind the words in the image. Still a few issues to solve
there..

.. However the main issue at present is one of performance. There is
a lot of text to insert, sometimes up to 15,000 words per page! and
this takes some time. We are using C# 2.0. Do we have to create an
ElementBuilder and ElementWriter for each word we insert? Obviously
creation and disposal of all these instances of these classes will
take up significant time. Is there a better/quicker way?

If we want to parallel the operation in some form is it prefereable to
create the individual pages separately on different machines/
processors and then join them together afterwards or should we just
parallel the newspapers so more than one is processing simultaneously?

Appreciate any advice to optimise the throughput.
-----
A: You can follow the advice from the following PDFNet KB article
(http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/
aed5b48a107f193a).

You can also reuse the same ElementBuilder & ElementWriter to generate
all text within a document. Creating a new ElementBuilder &
ElementWriter for every page or for any word may lead to performance
issues. Actually an ElementBuilder & ElementWriter can be used to
process any number of PDF documents, but this is probably not the best
idea since the performance penalty is negligible. To speed up
processing you may want to reuse fonts (as suggested in the above
article).

If we want to parallel the operation in some form is it
prefereable to create the individual pages separately

You may want to run several threads that generate/process individual
documents. Processing pages in parallel is not as good because pages
may have lots of things in common (e.g. fonts & other resources).