Optimizing PDF operations for page templating and forms manipulation.

Aaron_Gravesdale · December 16, 2008, 9:00pm

Q: Our pdf utilization is relatively simple where common operations
include: replacing form text fields, copying a template many (up to
2000) times and filling out text fields, and merging pdf documents.
Our primary concern includes RAM utilized during document construction
and final file size. Below is a quick and dirty snipet for copying a
pdf "template" a number of times. Excusing the repeated file template
load, what recommendations can you give for optimizing this operation
and final file size?

PDFNet.Initialize();

            // Read a PDF document in a memory buffer.
            //FileStream fileStream1 = new FileStream(@"C:\Testing
\Console\GeneralTest\PDFTemplates\Sample.pdf", FileMode.Open,
FileAccess.Read);
            //BinaryReader reader1 = new BinaryReader(fileStream1);
            //byte[] file_buff = reader1.ReadBytes((int)
reader1.BaseStream.Length);
            //PDFDoc doc1 = new PDFDoc(file_buff, file_buff.Length);

PDFDoc doc2 = new PDFDoc();
doc2.InitSecurityHandler();

            for (int i = 1; i <= 500; i++)
            {
                PDFDoc doc1 = new PDFDoc(@"C:\Testing\Console
\GeneralTest\PDFTemplates\Sample.pdf");
                doc1.InitSecurityHandler();

                doc1.GetField("month").SetValue("January");
                doc1.GetField("month").RefreshAppearance();
                doc1.GetField("year").SetValue("2008");
                doc1.GetField("year").RefreshAppearance();
                doc1.GetField("address").SetValue("123 MoxBerry Lane
Boulder, CO 80301");
                doc1.GetField("address").RefreshAppearance();
                doc1.GetField("permit no").SetValue("5432 - ZS");
                doc1.GetField("permit no").RefreshAppearance();
                doc1.GetField("city").SetValue("boulder");
                doc1.GetField("city").RefreshAppearance();
                doc1.GetField("state").SetValue("Colorado");
                doc1.GetField("state").RefreshAppearance();
                doc1.GetField("zip").SetValue("80301");
                doc1.GetField("zip").RefreshAppearance();
                doc1.GetField("fein").SetValue("839084503948");
                doc1.GetField("fein").RefreshAppearance();
                doc1.GetField("areacode").SetValue("303");
                doc1.GetField("areacode").RefreshAppearance();
                doc1.GetField("phoneno").SetValue("555-1234");
                doc1.GetField("phoneno").RefreshAppearance();

pdftron.PDF.Page pg = doc1.GetPage(1);
doc2.PageInsert(doc2.GetPageIterator(1), pg);

doc1.Close();

                if (i % 10 == 0)
                    WriteMemoryUsage("After " + i.ToString() + "
pages");
            }

WriteMemoryUsage("Before file write");

            string fileName = @"C:\temp\" + Guid.NewGuid().ToString()
+ ".pdf";
            doc2.FlattenFields();
            doc2.Save(fileName, SDFDoc.SaveOptions.e_linearized);
            doc2.Close();
            WriteMemoryUsage("After file write");

FileInfo fi = new FileInfo(fileName);
log.Info("Final filesize: " + (fi.Length /
1048576).ToString());

PDFNet.Terminate();
-------
A: One way to optimize the process is to open the template (i.e. doc1)
only once (instead of 500 times).

You can then place the template page in the destination document
using:

doc2.PagePushBack(doc1.GetPage(1));

Actually you may want to call this method only once and then replicate
the page within the destination document.

doc2.PagePushBack(doc2.GetPage(1));

This way all pages will share the common resources (such as fonts,
color spaces, etc) instead of replicating them for each page.

After importing the template page you can set the field values on the
destination page. Something along the following lines:

Page dest_pg = doc2.GetPage(page_num);
int num_annots = page.GetNumAnnots();
for (int i=0; i<num_annots; ++i) {
  Annot annot = page.GetAnnot(i);
  if (annot.IsValid() && annot.GetType()==Annot.Type.e_Widget) {
    Field fld = annot.GetWidgetField();
    String field_name = fld.GetName();
    If (field_name.StartsWith("month")) {
       fld.SetValue("January");
       fld.RefreshAppearance();
    }
  }
}

In case you are planning to process large documents and you don't care
about PDF linearization (web-optimization), you may notice a small
speed increase when the second parameter in the call to pdfdoc.Save()
is set to 0.

doc2.Save(fileName, 0);

For merging/copying multiple pages from one PDF document to another
the key optimization tip is to use pdfdoc.ImportPages(pagelist) before
placing pages in the document's page sequence (see
http://www.pdftron.com/net/usermanual.html#copy_pg and the code sample
6 in PDFPage sample project - http://www.pdftron.com/net/samplecode.html#PDFPage).

To further decrease the RAM usage you may want to read/write PDF from
a file on disk (a temp file?) instead of using a memory buffer.

Aaron_Gravesdale · December 19, 2008, 10:17pm

Fantastic recommendation on the copying pages internally. I did
realize that we shouldn't load the template a bunch of times, but we
were performance testing against a similar setup we have in production
for a different api that we currently use. Obviously not optimal.
Anyway, your platform is enormously more performant than our current
product. Thanks for your help.