How do I maintain small file size after PDF processing and page editing using PDFNet SDK?

Q: I’m noticing that my PDF files are bigger after I added some PDF
transformation using PDFNet SDK (http://www.pdftron.com/pdfnet).
Basically, I’m concatenating documents, modifying text, changing text
colors, scaling pages and overlaying pages. Every time I need to
modify a page I insert a new page, copy/modify elements to the new
page from the old then delete the old page. Is there an issue with
doing that frequently? I am using ImportPages, sub-setting fonts,
caching fonts for re-use, and saving documents with the ‘remove
unused’ option. The code below shows what I’m doing before and after
making page changes:

// Create new page
Page page = m_doc.PageCreate(m_page.GetMediaBox());
page.SetRotation(m_page.GetRotation());

// Do stuff like modifying text, changing text color, scaling and
overlaying page

// Replace current page with new
PageIterator iterPage = m_doc.GetPageIterator(iPage);
m_doc.PageInsert(iterPage, page);
m_doc.GetSDFDoc().Swap(m_page.GetSDFObj().GetObjNum(), page.GetSDFObj
().GetObjNum());
-----------------
A: How are you saving the document (i.e. what are the flags that you
pass in Save())? Also could it be that there are dead references
(possibly from some annotations or bookmarks) that are pointing to old
pages and are keeping them in memory [for more info please see
http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/5282fd17371ee806#
- "How do I detect broken links and dead object references in my PDF
files? "]?

Q: I use the e_remove_unused and e_linearized flags when saving. The
code I included in my last email was incomplete. It also copies
annotations from the old page to the new as shown below. Could that
explain the “dead references” from annotations you mention? What else
could I be doing to create dead references?

Also, I’ve discovered that my “Transform” and “Overlay” logic seem to
account for a lot of the large PDF file size increase. I’ve attached
the code for both. The “Transform” function scales and positions a
page and the “Overlay” function places a page from another document
onto the current page. Is there a problem with my approach or a better
way to do these?

public void Overlay(Page page, double dblX, double dblY,
double dblScale)
{
try
{

ElementBuilder builder = new ElementBuilder();
ElementWriter writer = new ElementWriter();

writer.Begin(m_page);

builder.Reset();

// Create form that contains overlay page contents
Element element = builder.CreateForm(page, m_doc);

// Get current coordinate space for page (accounts for
page rotation)
Matrix2D mtxPage = m_page.GetDefaultMatrix();

Matrix2D matrix = mtxPage * new Matrix2D(dblScale, 0, 0, dblScale,
dblX, dblY);
element.GetGState().SetTransform(matrix);
writer.WritePlacedElement(element);

// Finish writing to page
writer.End();
writer.Dispose();
}
catch(Exception ex)
{
throw PDFManager.Exception(“Unable to overlay page.”,
ex);
}

… Transform() is similar …


A: Could it be that the problem is caused because pages are
“Transform” -ed and “Overlay” –ed one at a time.
In order to keep the file size low you could import all overlay/
transform pages into the target document (without inserting them into
document page sequence) and then calling builder.CreateForm
(imported_page).

In case the suggested change to your “Transform” and “Overlay”
function does not help with the file size, could you please send us a
(relatively small) test file generated with your application. Using
CosEdit (http://www.pdftron.com/cosedit) we could inspect the file and
determine a possible source of error.

Regarding a ‘dead reference’, I mean a link/connection from an object
in the current page sequence (such as an annotation, bookmark, etc) to
a page (or some other object) which is no longer in page sequence.
This is similar in concept to garbage collection in .NET. If there are
still active references to a given object it will not be garbage
collected until the reference is removed.

I assume that you will place the overlay content (i.e. header/footer)
on many pages in the target document.
In order to share the overlay between all page instances you should
import the source page into the target document only once
(builder.CreateForm(page, m_doc)).

In addition, if you need to import more than one page from the source
document you should first import the entire page set in one swoop
using pdfdoc.ImportPages(). This will guarantee that all shared
resources (such as fonts) between overlay page set are being
preserved.

The following pseudo-code assumes that you are importing only a single
page overlay:

static ElementBuilder builder = null;
static ElementWriter writer = null;
static Obj overlay_form_xobject = null;

void MyMain(Page overlay_page, )
{
   if (builder == null) { builder = new ElementBuilder(); }
   if (writer == null) { writer = new ElementWriter(); }

   // Cache form XObject for repeated use across different pages.

   if (overlay_form_xobject == null) {
      overlay_form_xobject = builder.CreateForm(overlay_page,
m_doc).GetXObject();
   }

...

   Overlay(m_doc.GetPage(1), form_xobject, X, Y, Scale);
   Overlay(m_doc.GetPage(2), form_xobject, X, Y, Scale);
}

public void Overlay(Page dest_page, Obj form_xobject, double dblX,
double dblY, double dblScale)
{
  try
  {
    writer.Begin(dest_page);
    builder.Reset();

    // Create form that contains overlay page contents
    Element element = builder.CreateForm(overlay_form_xobject);

    // Get current coordinate space for page (accounts for page
rotation)
    Matrix2D mtxPage = m_page.GetDefaultMatrix();

    Matrix2D matrix = mtxPage * new Matrix2D(dblScale, 0, 0, dblScale,
dblX, dblY);
    element.GetGState().SetTransform(matrix);
    writer.WritePlacedElement(element);

    // Finish writing to page
    writer.End();
  }
  catch(Exception ex) {
    throw PDFManager.Exception("Unable to overlay page.", ex);
  }
}

Q: Reusing form XObjects helps somewhat, however the transtormed PDFs
are still much larger than the original documents.
I use the following function to place scaled and translated PDF page
on a new page. Can you please suggest any other ways to decrease the
file size of generated documents:

public void Transform(double dblX, double dblY, double
dblScale)
{
try
{

// Create new page
Page page = m_doc.PageCreate(m_page.GetMediaBox());
page.SetRotation(m_page.GetRotation());

// Get current page index
int iPage = m_page.GetIndex();

ElementBuilder builder = new ElementBuilder();
ElementWriter writer = new ElementWriter();

writer.Begin(page);

builder.Reset();

// Create form that contains page contents
Element element = builder.CreateForm(m_page, m_doc);

Matrix2D matrix = new Matrix2D(dblScale, 0, 0, dblScale, dblX,

dblY);

element.GetGState().SetTransform(matrix);
writer.WritePlacedElement(element);

// Finish writing to page
writer.End();

writer.Dispose();

// Copy annotations to new page and position based on
scaling
int iCount = m_page.GetNumAnnots();
for (int i = 0; i < iCount; i++)
{
Annot annotOld = m_page.GetAnnot(i);
Annot annot = new Annot(annotOld.GetSDFObj());

Rect rect = annot.GetRect();

double dblWidth = rect.Width();
double dblHeight = rect.Height();

// Shift and scale rect x-coordinate
rect.x1 = dblX + (rect.x1 * dblScale);
rect.x2 = rect.x1 + (dblWidth * dblScale);

// Shift and scale rect y-coordinate
rect.y1 = dblY + (rect.y1 * dblScale);
rect.y2 = rect.y1 + (dblHeight * dblScale);

// See if changing scale
if (dblScale != 1.0)
{
Annot.BorderStyle style = annot.GetBorderStyle
();

if (style.width > 0)
{
// Scale border width (but minimum is one)
dblWidth = style.width * dblScale;
dblWidth = ((dblWidth - Math.Floor
(dblWidth) > 0.5) ? Math.Ceiling(dblWidth) : Math.Floor(dblWidth));
style.width = Math.Min(1, (int) dblWidth);
annot.SetBorderStyle(style);
}
}

// Change rectangle
annot.SetRect(rect);

page.AnnotPushBack(annot);
}

// Replace current page with transformed page

PageIterator iterPage = m_doc.GetPageIterator(iPage);
m_doc.PageInsert(iterPage, page);
m_doc.GetSDFDoc().Swap(m_page.GetSDFObj().GetObjNum(),
page.GetSDFObj().GetObjNum());

m_doc.PageRemove(m_doc.GetPageIterator(iPage));

// Use new page
m_page = m_doc.GetPage(iPage);
}
catch(Exception ex)
{
throw PDFManager.Exception(“Unable to transform
page.”, ex);
}
}


A: The increase in file size from ‘Transformed.pdf’ to
‘TransformedOverlayed.pdf’ is solely due to repeated form xobjects (as
discussed in the previous email).

The increase in file size from ‘Original.pdf’ to ‘Transformed.pdf’ is
due to dead references (as expected).

To detect this we wrote a utility function (attached) that goes
through the entire document and reports any references to page objects
that are no longer in the main page sequences (i.e. dead references).

Running this function on ‘Transformed.pdf’ produces the following
output:

Dead reference: 303 in 25
Dead reference: 275 in 111
Dead reference: 313 in 322
Dead reference: 313 in 323
Dead reference: 313 in 324
Dead reference: 313 in 325
Dead reference: 313 in 326
Dead reference: 313 in 327
Dead reference: 313 in 328
Dead reference: 313 in 329

Using CosEdit (http://www.pdftron.com/pdfcosedit) we found that the
culprit is that some annotation dictionaries still have a reference to
the page through its “P” entry.

To fix this problem you could either erase optional “P” entry in all
annotations (i.e. annot.GetSDFObj().Erase(“P”)) or update it to point
to the new page (annot.GetSDFObj().Put(“P”, new_page.GetSDFObj())).

Please let me know if this helps.

Q: Maybe I misunderstand but I’m not comfortable with this solution.
It seems to me that the question shouldn’t be how to deal with dead
references after the fact but how to prevent them in the first place.
What is it about my Transform logic (in particular, the call to Swap)
that causes the dead references and the large file size? It seems that
grinding through the document and locating/fixing dead references
would be expensive. Every time I modify a page by doing the usual
insert/remove page routine I would have to check the document for
annotations that need to be updated to point to the new page. Am I
correct?


A: Dead references are caused because you are cloning annotations from
an old page to a new page. Some of the cloned annotations are still
pointing to the old page via the optional “P” (for ‘parent page’)
entry.

how to prevent them in the first place?

Instead of cloning annotations, you could create them from scratch and
set all the properties based on the values from old annotations. Of
course, this is lots of work and your current approach minimizes the
number of lines of code you need to write:

page.AnnotPushBack(new Annot(annot.GetSDFObj())); page.GetAnnot
(i).GetSDFObj().Erase(“P”); <---- !

We will probably add this step within AnnotPushBack() in a future
PDFNet version.

It seems that grinding through the document and locating/fixing dead
references would be expensive.
Every time I modify a page by doing the usual insert/remove page
routine I would have to check the document for annotations that need
to be updated to point to the new page. Am I correct?

The provided sample code is intended for troubleshooting and debugging
this type of issues. You would probably not use it in the production
code.