Using PDFNet to Optimize Scanned PDFs for deskew, background/shadow/halo removal, despeckle, descreen ...

Q: We scan all of our internal documents. We want to Optimize Scanned PDF for this such as dpi, deskew, background removal, shadow removal, despeckle, descreen, halo removal. This is specific to scanned documents.

Our scanned documents are stored on a Linux server. All are black & white only.

Currently, we access the files via a Windows computer and use Acrobat to periodically and manually run the “Optimize Scanned PDF” function in Batch Processing mode.

We would like to have a program that could run periodically, maybe via a cron job, that would check a directory and automatically “optimize” any pdf documents in that directory and move them to another directory.

Can you help? Does your SDK have this “Optimize Scanned PDF” functionality?

A:

PDFNet Optimizer (http://www.pdftron.com/pdfnet/samplecode.html#Optimizer) can be used to remove duplicate resources (e.g. font, color spaces, images, etc.), to recompress, and down-sample images, but it does not offer built-in options to enhance scanned documents (e.g. de-skew, background removal, shadow removal, de-speckle, de-screen, halo removal …).

At the same time, it would be fairly simple to use PDFNet to implement this type of image enhancement. For example, you could use the code along the lines of ImageExtract to extract embedded images, then pass images to the image enhancement tool (such as ImageMagic etc) , then replace the original image in PDF with the enhanced version (see JBIG2 sample for a concrete example - http://www.pdftron.com/pdfnet/samplecode.html#JBIG2, also https://groups.google.com/d/topic/pdfnet-sdk/oWlA10uDOdk/discussion,https://groups.google.com/d/topic/pdfnet-sdk/aVDOEhmuH68/discussion etc).