How do I Flate compress/optimize objects in existing PDF documents?

Aaron_Gravesdale · August 26, 2008, 11:43pm

Q: How can I get PDFNet to compress fonts and other objects in
existing PDF documents?

I’m reading in a PDF and extracting X number of pages (curerntly 12)
to a separate PDF. The resulting PDF (3,299KB) is larger than
compared to output from Acrobat Professional (637KB). I am using the
ImportPages(ArrayList) then PagePushBack(Page), so the issue is not
the lack of reusing of resources.

Using Acrobat Pro’s PDF Optimizer\Audit space usage, I see that
Acrobat’s output file uses 530KB for Fonts. When looking at the
PDFNet outout file, Acrobat shows over 3MB for Fonts. Using File,
Properites, I’ve confirmed that both files have the same number of
fonts.

Using CosEdit, I see that the Acrobat’s output has a filter of
“FlateDecode” on all the font character streams. On the PDFNet
output, there is no filter key.

My first approach was to iterate through the document objects to find
the Font objects. Then, go through each CharProcs and read in each
stream. Then, setup a FlatEncode to write the stream back out.
Reading appears to be working. But, writing the stream doesn’t have
any affect.

Here’s my code snippet:

itr = obj.Find(“CharProcs”).Value.GetDictIterator
While itr.HasNext

'No Encryption/Compression is used, so “Raw” stream should
work
f = itr.Value.GetRawStream(True)
fr = New PDFTRON.Filters.FilterReader(f)

'Create and Fill Memory Filter (Buffer)
fm = New
PDFTRON.Filters.MemoryFilter(itr.Value.GetRawStreamLength, False)
fw = New PDFTRON.Filters.FilterWriter(fm)
fw.WriteFilter(fr)

'Seek memory filter back to begining and allow to be used
as Input Filter
fm.Seek(0, PDFTRON.Filters.Filter.ReferencePos.e_begin)
fm.SetAsInputFilter()

fe = New PDFTRON.Filters.FlateEncode(fm, 9, 256)
fw2 = New PDFTRON.Filters.FilterWriter(fe)

itr.Value.Write(fw2)

itr.Next()
End While

A: The problem is that the original PDF document contains data
streams which are not compressed. By default, PDFNet will compress all
new data streams, however it will not try to force re-compression of
objects that are already present in existing PDF documents. Acrobat
Pro will always force object re-compression which explains the
difference in file size.

Nevertheless PDFNet API does not prevent you from implementing a small
utility function that will compress any streams that are not
compressed. As an illustration we have modified JBIG2 sample project
for this purpose (please see the attachment).

static void Recompress(PDFDoc doc)
{
SDFDoc cos_doc = doc.GetSDFDoc();
int num_objs = cos_doc.XRefSize();
for (int i=1; i<num_objs; ++i) {
Obj obj = cos_doc.GetObj(i);
if (obj!=null && !obj.IsFree()&& obj.IsStream()) {
Obj flt= obj.FindObj(“Filter”);
if (flt != null) {
if (flt.IsArray())
if (flt.Size()==1) flt = flt.GetAt(0);
else continue;

string flt_name = flt.GetName();
if (flt_name != “ASCIIHexDecode” && flt_name !=
“ASCII85Decode”)
continue;
}

// The stream is not compressed…
FilterReader reader = new FilterReader(obj.GetDecodedStream());
Obj new_stm = cos_doc.CreateIndirectStream(reader, new
FlateEncode(null));

// Copy any entries from the old stream dictionary
for (DictIterator itr = obj.GetDictIterator(); itr.HasNext();
itr.Next()) {
string key = itr.Key().GetName();
if (key == “Filter” || key == “Length”) continue;
new_stm.Put(key, itr.Value());
}

cos_doc.Swap(i, new_stm.GetObjNum());
}
}
}

The above utility can be called to optimize uncompressed stream just
before extracting pages to a new document or saving the existing PDF.

You can also download full sample code from ‘Files’ section in this
Forum:
http://groups.google.com/group/pdfnet-sdk/web/FlateCompressTest.zip