PDF Image extraction.

Aaron_Gravesdale · April 7, 2009, 7:46pm

Q: So far we have been quite impressed with what we can get out of
your SDK. It is easy to work with and manages to simplify the PDF
process quite a bit. Comparison against some of the alternatives is
also very favourable.

I have run into a problem working with the image extraction. Sometimes
images come out flipped or mirrored along one axis or another.
Sometimes they only contain a thin strip from the edge of the image as
displayed by Adobe Reader. Sometimes we find hundreds of images of a
single black pixel. Do you think these are limitations with the SDK or
just internal problems with the PDF files we have to deal with?

I should note that many of the files we have are badly scanned from
old paper manuals in various languages. Sometimes the images are
partially OCR'd or divided into strips by certain kinds of scanners.
Basically, the only thing we can guarantee about the files is that
they are going to be broken in one way or another.

Do you have any advice about how we might try to aquire the full image
in the correct orientation?
-----
A: ImageExtract sample (http://www.pdftron.com/pdfnet/
samplecode.html#ImageExtract) extracts images as they are stored in
PDF (which may not be exactly the same as then viewing the image on
rendered PDF page). When a drawing a PDF page, a PDF viewer needs to
apply a affine transformation matrix in order to properly place the
image on a page. The transformation matrix may scale, rotate, shear,
and translate the image to a specific location on the page. If you are
using ElementReader to enumerate PDF content, you can use
element.GetCTM() to obtain this transformation matrix. Information
about image placement, rotation, and dimensions (resolution) can be
deduced from CTM (Current Transformation Matrix). If you search PDFNet
Forum (http://www.pdftron.com/pdfnet/forum.html), you will find more
information on this topic (e.g. you may want to use keyword such as
'image DPI', 'image rotation', 'image position', etc).

You may sometimes get image 1x1 pixel when the image is used to be
stretched into a rectangular region (instead of drawing a path
rectangle). This is probably a bad practice, however many PDF
producers are still relying on this technique.

Using PDFNet you should be able to get all the information required
for accurate reproduction of PDF. We are using the same PDF content
extraction APIs to implement PDF rasterizers and various kinds of PDF
converters, so the API is quite flexible and powerful.

Aaron_Gravesdale · April 7, 2009, 10:20pm

Q: We need to extract every image element from a PDF (or batch
thereof), apply all necessary transforms on them and output to image
files.

The only thing I can see provided in the API is the Matrix2D you told
me about before, which does not seem to have any documentation for
creating an image from an element in the PDF. Unfortunately, it is
obvious that a level of understanding of the mathematics involved is
required to use the class to do anything.

If your tool already does this, then there will be no need for me to
learn the fine art of matrix manipulation and affine transforms.
-----
A: PDF format is designed to be device independent. This means that a
PDF page can be printed on any device regardless of the resolution
(i.e. PDF page can be 'zoomed'). Creating an output image where you
apply the transformation matrix (element.GetCTM()) to the base image
can be done, but I am guessing that this is not what you want to do
(actually I am not completely clear regarding your requirements).

You could also copy each image element to a separate (temporary) PDF
page [as in ElementEdit sample] and use PDFDraw [as in PDFDraw sample]
to convert the page with a single image element to an image. This will
result in a properly transformed image at a specific resolution.