Q: When parsing PDF documents, we have some PDFs from which we cannot
extract images. Here is the code we are using to extract images:
pdftron.PDF.Image image = new
pdftron.PDF.Image(imageElement.GetXObject());
We have also tried imageElement.GetBitmap();
In either case, an error is not thrown when extracting the image in
this way. Why are we getting this error and what can we do to extract
these images?
---------
A: The most likely problem is that you have encountered an 'inline-
image' object (instead of XObject image).
Element.Type type = element.GetType();
If (type == Element.Type.e_image) {
pdftron.PDF.Image image = new
pdftron.PDF.Image(element.GetXObject());
image.Export(fname); // or ExporAsPng() or ExporAsTiff() ...
// ...or convert PDF bitmap to GDI+ Bitmap...
//Bitmap bmp = element.GetBitmap();
//bmp.Save(fname, ImageFormat.Png);
//bmp.Dispose();
}
else if (type == Element.Type.e_inline_image) {
... see below ...
}
In case of inline-image object the only way to extract the data is
using element.GetImageData().
GetImageData() returns a filter (i.e. a raw decompressed image stream)
object. This simples way to access this data is using FilterReader as
shown below:
FilterReader reader = new FilterReader(element.GetImageData());
byte[] image_data_out = new byte[1]; // A buffer used to keep image
data.
reader.Read(image_data_out); // image_data_out contains RAW image
data.
Because the raw image data may be represented using different color
spaces and pixel formats you could normalize all data to RGB format
using pdftron.PDF.Image2RGB filters. For example:
Image2RGB img_conv = new Image2RGB(image); // Extract and convert the
inline-image to RGB 8-bpc format
FilterReader reader = new FilterReader(img_conv); //
byte[] image_data_out = new byte[1]; // A buffer used to keep image
data.
reader.Read(image_data_out); // image_data_out contains RAW RGB image
data.
You can also read a chuck of an image a time by repeatedly calling
reader.Read(buf, buf_sz) until the function returns 0.