Detect broken PDF files before or after PDF Optimization

Q:

Using ‘PDF Optimizer Add-on’ in PDFNet we implemented a solution for our customer that shrinks & optimizes images in PDF files.
It works generally fine, but the previous days our customer was facing a strange situation. PDF Optimizer library generated an optimized PDF file (original PDF file was normal with no errors) with some blank pages and other pages which seemed like they were corrupted (or at least they displayed as blank pages in Acrobat Reader).

It’s very important for us to have a way to check the validity / quality of the generated optimized PDF files in order to avoid similar cases in the future. What are our options for this using PDFNet SDK?

A:

This type of issues typically occur when you are dealing with corrupt/damaged PDF files. We are also constantly improving PDFNet capability to deal with malformed files. You can try the latest unofficial version to find out whether it helps with files you are dealing with

Unfortunately it is technically impossible to support every corrupt file. If the above build does not help, could you please share a sample input/output file and we will take a look at it.

Btw. there are couple of things you can do to make sure that optimization process does not expose any errors:

a) If doc.IsModified() returns true after the original file is opened, it means that the file is corrupt. In this case you may want to flag the file and skip the optimization (https://groups.google.com/d/msg/pdfnet-sdk/Bscvecad6As/Xu0F4bSh1hoJ). Unfortunately there are corrupt files for which IsModified() does not return true, so it is not a full proof ‘validation’ method. Believe it or not, to this day there is no way to say whether a PDF files is completely valid and free from defects (http://blog.pdftron.com/2013/09/05/all-about-pdfa/).

b) After optimizing the file you can perform some checks:

  • Verify that both original & final document have the same number of pages.

  • Use PDFDraw (http://www.pdftron.com/pdfnet/samplecode.html#PDFDraw) to render a random sample of pages from source & destination file and use image diff (e.g. imagemagick dot org/Usage/compare) to automatically identify any mismatches.