How can I detect whether a PDF is a scanned image?

agravesdale · October 9, 2014, 6:42pm

Q:

How can I determine whether or not a PDF is a scanned image, and thus contains no selectable text?

A:

In general, scanned PDFs contain a single large image. Sometimes PDF creators will run scanned images through an OCR reader and overlay invisible, selectable text over the image.

The easiest way to determine if a PDF page contains any selectable text is to run TextExtractor (https://www.pdftron.com/pdfnet/samplecode.html#TextExtract) over the page and see what it finds. It sounds like this would be the easiest solution for your use case, it it sounds like you’re just interested in whether the page contains selectable text.

An alternative method — one that would let you be more sure that this is a scanned PDF and not just a page without text — would be to use ElementReader (https://www.pdftron.com/pdfnet/samplecode.html#ElementReaderAdv). You could use ElementReader to check for text. Additionally, you could check whether the page contains a single image. You could also check whether the image is monochrome, which is common for scanned PDFs.

(See also: https://groups.google.com/d/msg/pdfnet-sdk/Wq_aDhzRYQw/qk8-7EgI2ZIJ).