How can I detect if an image is used repeatedly within PDF?

Aaron_Gravesdale · December 15, 2008, 11:55pm

Q: When parsing PDF Documents containing images, I find many documents
are reusing images. So I need to only extract the image the first time
it is encountered, and then refer to that instance of the image at all
other times. How can I check for this when parsing?

I was planning on something such as:

Obj sdf = imageElement.GetXObject ();
Int num = sdf.GetObjNum () //HERE IS WHERE I NEED HELP
If (! Parsed (num))
{
pdftron.PDF.Image image = new pdftron.PDF.Image(sdf);
}

Would this be the basic approach to take when using PDFTron? How can I
get an ID such that I don’t parse the same image twice?

To further illustrate, see example 4.28 from the PDF spec below. The
xobject is in blue. I want to only have to parse this 1 time. I want
every time that it is referenced as in the red code below to get some
kind of id rather than the image object.

Example 4.28

20 0 obj% Page object
<< /Type /Page /Parent 1 0 R /Resources 21 0 R /MediaBox [ 0 0 612
792 ] /Contents 23 0 R>>endobj
21 0 obj% Resource dictionary for page
<< /ProcSet [ /PDF /ImageB ] /XObject << /Im1 22 0 R >>>>endobj
22 0 obj% Image XObject
<< /Type /XObject /Subtype /Image /Width 256 /Height 256 /ColorSpace /
DeviceGray /BitsPerComponent 8 /Length 83183 /Filter /ASCII85Decode>>
stream9LhZI9h\GY9i+bb;,p:e;G9SP92/)X9MJ>^:f14d;,U(X8P;cO;G9e];c$=k9Mn\]
… Image data representing 65,536 samples …8P;cO;G9e];c$=k9Mn\]~>
endstream
endobj
23 0 obj% Contents of page
<< /Length 56 >>streamq% Save graphics state
132 0 0 132 45 140 cm% Translate to (45,140) and scale by 132
/Im1 Do% Paint image
Q% Restore graphics state
Endstream
endobj
-----
A: There are couple of options:

a) You could maintain a map of object numbers (for image objects) that
you have visited.
So you can search this map using sdf.GetObjNum() and if the image is
not found insert [sdf.GetObjNum(), sdf] in the map

b) You could mark SDF object as visited using sdf.SetMark(true). To
check if the object was visited use sdf.IsMarked(). To clear marks use
sdfdoc.ClearMarks();

Aaron_Gravesdale · December 16, 2008, 7:55pm

Q: What I wanted to make sure was that GetObjNum() would be the
correct method call to make. Is GetObjNum() supposed to return the
same thing each time an image is referenced as below? For instance,
suppose I have a pdf document with 1 image and 5 pages, and I paint
the image with the 'Do' operator on each page. When parsing with
PDFTron I will get an element on each page where element.getType() ==
Type.e_image. Now on each page I will call imageElement.GetXObject
().GetObjNum() (total of 5 separate calls). This should return the
same thing each of the 5 times, is that correct?
-----
A: You are right. You can identify shared objects by comparing their
object numbers (i.e. using GetObjNum() method).