Low level PDF Parser

vasudha.sahu · April 22, 2021, 12:09pm

Hi Team,

I need to write a C++ Windows COS layer PDF parser which can parse all the direct and indirect object for a given pdf document. We will be doing further analysis on each object.

I came across API to create indirect objects, read an object given an object number but could find API to parse a complete document to fetch all the direct and indirect objects.

I am doing this tryout for my organization. If these capabilities are present in SDK I can do a POC and recommend the SDK for licensing.

Thanks!
Vasudha Sahu

Ryan · April 22, 2021, 4:03pm

Yes, you can definitely do this with PDFTron SDK.

Please see the code in this forum post, which shows how to parse all indirect and direct objects in a PDF.

vasudha.sahu · April 23, 2021, 11:45am

Thanks Ryan.
So basically XRefSize is there, which gives us the size of cross-reference table. Therefore, enumerating from object no. 1 to XrefSize will give us all indirect objects.
Does this method gives us all the indirect object even in case of incremental updates and multiple cross-reference tables.
And how about all the direct objects then? Is there any enumeration callbacks and object parser class for SDFDoc which can enumerate the complete doc for us.

Just to give you a brief overview, here we are trying to extract all the COS layer objects.
We will further analyzing these objects. In broad picture we will mainly :

Extract all the Javascript. - this is very much possible with PDFTron.
Extract all /ObjStm objects.
Extract the decoded embedded file streams and other object streams. - It does have APIs like GetDecodedStream and GetRawStream.

Thanks & Regards,
Vasudha Sahu

Ryan · April 23, 2021, 3:53pm

Yes, the Xref is all the indirect objects, and the code referenced in the other post shows how to then iterate all the direct objects (including Arrays, Streams and Dictionaries) recursively. So in the end you have iterated every object, both indirect and direct.

even in case of incremental updates and multiple cross-reference tables.

Yes, our SDK handles incremental updates, but you see the final versions of the objects. So if an object was modified or deleted, you would not see the original object. If that is not what you are looking for then please elaborate.

Is there any enumeration callbacks and object parser class for SDFDoc which can enumerate the complete doc for us.

Yes, the code in the other forum post does all the enumerating for you, using our APIs.

Extract all the Javascript. - this is very much possible with PDFTron.

Yes, the other forum post referenced deletes all the javascript, but you could instead extract it. See this post.

Extract all /ObjStm objects.

Yes, our SDK parses all objects, including those in a compressed object stream.

Extract the decoded embedded file streams and other object streams. - It does have APIs like GetDecodedStream and GetRawStream.

Yes, exactly, there are API’s to access the stream as it is in the PDF, and also PDFTron can decode the streams for you so you can get the actual data.

vasudha.sahu · April 23, 2021, 9:26pm

Hi Ryan,

“Yes, the Xref is all the indirect objects, and the code referenced in the other post shows how to then iterate all the direct objects (including Arrays, Streams and Dictionaries) recursively. So in the end you have iterated every object, both indirect and direct.”

Yeah , I can observes it’s iterating through trailer dictionary. But as per Adobe PDF 1.7 documentation trailer dictionary do provide access to cross-reference table and special objects like catalog dictionary. But I am not sure, does it provide access to all the direct COS objects as well. For example, /JS, /Launch, /OpenAction, /EmbeddedFiles, /JavaScript, /AA etc. Basically we will be categorizing the object and maintaining the counter of all the objects in each category.
Pardon me, I am novice here.

“Yes, our SDK handles incremental updates, but you see the final versions of the objects. So if an object was modified or deleted, you would not see the original object. If that is not what you are looking for then please elaborate.”

That sounds good for now.

Do we clear segregration of Free objects and /ObjStm.?

We are suppose to keep a count of /ObjStm as well.
Earlier we are trying with OpenSource libs which do loose track of /ObjStm and consider them a free object in few case, thereby generating incorrect counters.

Can you please share the cost and licensing schemes for the C++ SDK Windows x86 and x64 architecture, so that I can propose the library for a POC.

Thanks & Regards,
Vasudha Sahu

Ryan · April 26, 2021, 7:40pm

I decided to write you clearer code, that parses all COS objects in a PDF (except if that object was deleted by an earlier incremental save. Similar for objects edited by an earlier incremental save, you only see the latest version of the object, which is in accordance with the PDF ISO standard).

This code will also include ObjStm objects, just look for Stream objects with a key of “Type” and with value “ObjStm”. You can check the output of the below code to see what I mean.

See below C# code. Currently it prints out the keys and values, but up to you what it should do.

static void HandleString(Obj obj)
{
	Console.Write("(" + obj.GetAsPDFText() + ")");
}

static void HandleName(Obj obj)
{
	Console.Write("/" + obj.GetName());
}

static void HandleBool(Obj obj)
{
	if (obj.GetBool()) Console.Write("true");
	else Console.Write("false");
}

static void HandleNumber(Obj obj)
{
	Console.Write(String.Format("{0}", obj.GetNumber()));
}

// does not check for null or free, do that before passing here
static void HandleObj(Obj obj)
{
	if (obj.IsNull()) Console.Write("NULL");
	if (obj.IsString()) HandleString(obj);
	if (obj.IsName()) HandleName(obj);
	if (obj.IsNumber()) HandleNumber(obj);
	if (obj.IsStream() || obj.IsDict()) HandleDictOrStream(obj);
	if (obj.IsArray()) HandleArray(obj);
}

static void HandleDictOrStream(Obj dict)
{
	Console.WriteLine("{");
	var diter = dict.GetDictIterator();
	for (; diter.HasNext(); diter.Next())
	{
		string keyName = diter.Key().GetName();
		Console.WriteLine("\n\"" + keyName + "\": ");
		Obj current = diter.Value();
		if (current == null) continue;
		if (current.IsIndirect()) continue; // skip
		HandleObj(current);
	}
	Console.WriteLine("\n}");
}

static void HandleArray(Obj arr)
{
	Console.WriteLine("[");
	for (int i = 0; i < arr.Size(); i++)
	{
		Obj item = arr.GetAt(i);
		if (item == null) continue;
		if (item.IsIndirect()) continue;
		HandleObj(item);
		Console.Write(", ");
	}
	Console.WriteLine("\n]");
}

static void PrintDocument(PDFDoc pdfdoc)
{
	SDFDoc sdfdoc = pdfdoc.GetSDFDoc();
	Console.WriteLine(sdfdoc.GetHeader());
	Obj trailer = pdfdoc.GetTrailer();
	if (trailer == null) return;
	Console.WriteLine("Trailer");
	HandleObj(trailer);
	Console.WriteLine("xref");
	for (int num = 1; num < sdfdoc.XRefSize(); num++)
	{
		Console.Write(String.Format("obj_{0}: ", num));
		Obj current = sdfdoc.GetObj(num);

		if (current == null)
        {
			Console.WriteLine("null");
			continue;
        }
		if (current.IsFree())
		{
			Console.WriteLine("free");
			continue;
		}
		HandleObj(current);
		Console.WriteLine("");
	}
}

vasudha.sahu · April 27, 2021, 8:20pm

Thanks Ryan!

This is really helpful. I will try this out on my end with C++ and get back to you in case of any doubts.

Can you please help me with the appropriate point of contact to the get details of licensing and cost. I need to present it in my organization to get the POC approvals for the SDK.

Thanks & Regards,
Vasudha Sahu

Ryan · April 27, 2021, 9:00pm

Great.

Can you please help me with the appropriate point of contact to the get details of licensing and cost.
If you have not done so already, please fill in this form.

jlucas · October 13, 2023, 4:14am

Found this code snippet to be very helpful. This should be in the official samples for .NET