Q: So far we have been quite impressed with what we can get out of
your SDK. It is easy to work with and manages to simplify the PDF
process quite a bit. Comparison against some of the alternatives is
also very favourable.
I have run into a problem working with the image extraction. Sometimes
images come out flipped or mirrored along one axis or another.
Sometimes they only contain a thin strip from the edge of the image as
displayed by Adobe Reader. Sometimes we find hundreds of images of a
single black pixel. Do you think these are limitations with the SDK or
just internal problems with the PDF files we have to deal with?
I should note that many of the files we have are badly scanned from
old paper manuals in various languages. Sometimes the images are
partially OCR'd or divided into strips by certain kinds of scanners.
Basically, the only thing we can guarantee about the files is that
they are going to be broken in one way or another.
Do you have any advice about how we might try to aquire the full image
in the correct orientation?
A: ImageExtract sample (http://www.pdftron.com/pdfnet/
samplecode.html#ImageExtract) extracts images as they are stored in
PDF (which may not be exactly the same as then viewing the image on
rendered PDF page). When a drawing a PDF page, a PDF viewer needs to
apply a affine transformation matrix in order to properly place the
image on a page. The transformation matrix may scale, rotate, shear,
and translate the image to a specific location on the page. If you are
using ElementReader to enumerate PDF content, you can use
element.GetCTM() to obtain this transformation matrix. Information
about image placement, rotation, and dimensions (resolution) can be
deduced from CTM (Current Transformation Matrix). If you search PDFNet
Forum (http://www.pdftron.com/pdfnet/forum.html), you will find more
information on this topic (e.g. you may want to use keyword such as
'image DPI', 'image rotation', 'image position', etc).
You may sometimes get image 1x1 pixel when the image is used to be
stretched into a rectangular region (instead of drawing a path
rectangle). This is probably a bad practice, however many PDF
producers are still relying on this technique.
Using PDFNet you should be able to get all the information required
for accurate reproduction of PDF. We are using the same PDF content
extraction APIs to implement PDF rasterizers and various kinds of PDF
converters, so the API is quite flexible and powerful.