Q: When parsing PDF Documents containing images, I find many documents
are reusing images. So I need to only extract the image the first time
it is encountered, and then refer to that instance of the image at all
other times. How can I check for this when parsing?
I was planning on something such as:
Obj sdf = imageElement.GetXObject ();
Int num = sdf.GetObjNum () //HERE IS WHERE I NEED HELP
If (! Parsed (num))
{
pdftron.PDF.Image image = new pdftron.PDF.Image(sdf);
}
Would this be the basic approach to take when using PDFTron? How can I
get an ID such that I don’t parse the same image twice?
To further illustrate, see example 4.28 from the PDF spec below. The
xobject is in blue. I want to only have to parse this 1 time. I want
every time that it is referenced as in the red code below to get some
kind of id rather than the image object.
Example 4.28
20 0 obj% Page object
<< /Type /Page /Parent 1 0 R /Resources 21 0 R /MediaBox [ 0 0 612
792 ] /Contents 23 0 R>>endobj
21 0 obj% Resource dictionary for page
<< /ProcSet [ /PDF /ImageB ] /XObject << /Im1 22 0 R >>>>endobj
22 0 obj% Image XObject
<< /Type /XObject /Subtype /Image /Width 256 /Height 256 /ColorSpace /
DeviceGray /BitsPerComponent 8 /Length 83183 /Filter /ASCII85Decode>>
stream9LhZI9h\GY9i+bb;,p:e;G9SP92/)X9MJ>^:f14d;,U(X8P;cO;G9e];c$=k9Mn\]
… Image data representing 65,536 samples …8P;cO;G9e];c$=k9Mn\]~>
endstream
endobj
23 0 obj% Contents of page
<< /Length 56 >>streamq% Save graphics state
132 0 0 132 45 140 cm% Translate to (45,140) and scale by 132
/Im1 Do% Paint image
Q% Restore graphics state
Endstream
endobj
-----
A: There are couple of options:
a) You could maintain a map of object numbers (for image objects) that
you have visited.
So you can search this map using sdf.GetObjNum() and if the image is
not found insert [sdf.GetObjNum(), sdf] in the map
b) You could mark SDF object as visited using sdf.SetMark(true). To
check if the object was visited use sdf.IsMarked(). To clear marks use
sdfdoc.ClearMarks();