How do I process/extract inline-images from PDF?

Aaron_Gravesdale · October 28, 2008, 8:53pm

Q: When parsing PDF documents, we have some PDFs from which we cannot
extract images. Here is the code we are using to extract images:

pdftron.PDF.Image image = new
pdftron.PDF.Image(imageElement.GetXObject());

We have also tried imageElement.GetBitmap();

In either case, an error is not thrown when extracting the image in
this way. Why are we getting this error and what can we do to extract
these images?
---------
A: The most likely problem is that you have encountered an 'inline-
image' object (instead of XObject image).

Element.Type type = element.GetType();
If (type == Element.Type.e_image) {
pdftron.PDF.Image image = new
pdftron.PDF.Image(element.GetXObject());
image.Export(fname); // or ExporAsPng() or ExporAsTiff() ...

  // ...or convert PDF bitmap to GDI+ Bitmap...
  //Bitmap bmp = element.GetBitmap();
  //bmp.Save(fname, ImageFormat.Png);
  //bmp.Dispose();
}
else if (type == Element.Type.e_inline_image) {
  ... see below ...
}

In case of inline-image object the only way to extract the data is
using element.GetImageData().

GetImageData() returns a filter (i.e. a raw decompressed image stream)
object. This simples way to access this data is using FilterReader as
shown below:

FilterReader reader = new FilterReader(element.GetImageData());
byte[] image_data_out = new byte[1]; // A buffer used to keep image
data.
reader.Read(image_data_out); // image_data_out contains RAW image
data.

Because the raw image data may be represented using different color
spaces and pixel formats you could normalize all data to RGB format
using pdftron.PDF.Image2RGB filters. For example:

Image2RGB img_conv = new Image2RGB(image); // Extract and convert the
inline-image to RGB 8-bpc format
FilterReader reader = new FilterReader(img_conv); //
byte[] image_data_out = new byte[1]; // A buffer used to keep image
data.
reader.Read(image_data_out); // image_data_out contains RAW RGB image
data.

You can also read a chuck of an image a time by repeatedly calling
reader.Read(buf, buf_sz) until the function returns 0.

Aaron_Gravesdale · October 29, 2008, 9:44pm

Q: I had found that code in ElementReaderAdvTest. I also have
confirmed that this is an inline image I am dealing with. The trouble
I have is in converting the image data into some kind of Image, either
a PDFTron.PDF.Image or a System.Drawing.Image(or Bitmap). I get an
error when I try to convert the image data returned from
getImageData() into an image, for instance:
            Image2RGB img_conv = new Image2RGB(imageElement);
            FilterReader reader = new FilterReader(img_conv);
            byte[] image_data_out = new
byte[imageElement.GetImageDataSize()];
            reader.Read(image_data_out);
            Bitmap bm = new Bitmap(image_data_out);

Returns the following error:
Parameter is not valid.
at System.Drawing.Bitmap..ctor(Stream stream)

How can the image data from Element.GetImageData() be used to
construct a correct System.Drawing.Bitmap?
------------
A: Unfortunately System.Drawing.Bitmap constructor does not allow you
to directly pass-in the raw image data. You would need to create a new
Bitmap of given dimensions (i.e.
element.GetImageWidth(),element.GetImageHeight()) and pixel format
(e.g. 24-bit RGB). Then you would need to copy RGB data into the image
using bmp.LockBits(new Rect(0, 0, width, heigh), ImageLockModeWrite,
PixelFormat24bppRGB, BitmapData). You can find more on this in MSDN
documentation or GDI+/.NET related forums (e.g. www.bobpowell.net/faqmain.htm).
While copying image data byte-by-byte you may need to skip some
padding bytes (stride-3*width) at the end of each line. This is
necessary because GDI+/.NET stores bitmap scan-lines aligned on a 4
byte boundary, whereas Image2RGB returns a data stream without any
padding bytes at the end of each line).

Aaron_Gravesdale · October 29, 2008, 11:06pm

Q: Thanks. One last question then. So assume:
element.GetBitsPerComponent() == 8 and element.GetComponentNum() ==
1,
then this would be an 8bpp file

element.GetBitsPerComponent() == 8 and element.GetComponentNum() ==
2,
then this would be an 16bpp file

etc.

Am I correct in this?
------
A: Correct, except that if you are using Image2RGB filter you don't
need to worry about different pixel formats (since all image data will
be normalized to 24-BPP RGB format).