How do I process form XObjects in PDF?

Q: I am writing a PDF processing application that will take any
existing PDF and will perform color conversion and other operations of
the document. For example, I would like to take a color PDF and
convert it to grayscale. So far the results I get using PDFNet SDK are
excellent, however I am not exactly clear how to deal with form
XObjects. In some cases, color converted content in From Objects is
not showing properly. The sketch of my conversion class was originally
based on your ElementReaderAdv sample (http://www.pdftron.com/pdfnet/
samplecode.html#ElementReaderAdv). The current code is along the
following lines:

void PDFConverter::convert(PDFDoc& doc, ElementReader& reader,
ElementWriter&
writer,bool save,int& elenum)
{
Element e;
bool form = false;
Obj softMask;
while( e = reader.Next() )
{
int type = e.GetType();
switch(type)
{
case Element::e_image:
case Element::e_inline_image:
if ( !e.IsImageMask() )
convertImage(doc, e);
else
convertImageMask(e);
break;
case Element::e_text:
convertElementText(doc, e,elenum);
break;
case Element::e_form: // Process form XObjects
reader.FormBegin();
convert(doc, reader, writer,false,elenum);
reader.End();
//Comentado
CheckColorSpaceForm(doc,e);
form = true;
break;
case Element::e_path:
case Element::e_shading:
convertElement(doc, e,elenum);
break;
}

softMask = e.GetGState().GetSoftMask();
if(softMask)
CheckSoftMask(softMask);
if(save)
writer.WriteElement(e);
if(form && !save)
writer.WriteElement(e);

form = false;

++elenum;
}
}

A: The problem is that the form XObject is not processed properly.
When ElementWriter outputs a form XObject (element_writer.WriteElement
(element)) it simply writes: ‘/form_id Do’ in the output content
stream.

Based on your code the converted PDF will contain duplicated content
(i.e. converted form xobject elements copied into the content stream
of a new page followed by a reference to the old form xobject).

There is couple of ways to deal with form XObjects.

Algorithm A)

  • Maintain a set of form XObjects to convert set.
  • When you encounter a form XObjects simply add (set.insert
    (elemenet.GetXObject().GetObjNum())) the form element to the above set (i.e. do
    not process children elements using reader.FormBegin()/End() etc) and
    output the element (element_writer.WriteElement(element)).
  • After processing all pages in the document walk through the list
    of all referenced form XObjects and call your convert function on each
    object in the list:
    ElementWriter w;
    w.Begin(doc);
    ElementReader r;
    Obj old_form = iterator.Current();
    r.Begin(old_form);
    onvert(doc, w, r, form_set, …);
    r.End();
    Obj new_form = w.End();

// Copy over entries from form xobject dictionary (Matrix, BBox,
etc)
new_form.Put(“Subtype”, old_form.FindObj(“Subtype”));
new_form.Put(“Matrix”, old_form.FindObj(“Matrix”));
new_form.Put(“BBox”, old_form.FindObj(“BBox”));

  • Swap the new form with the old form xobject:
    doc.GetSDFDoc().Swap(new_form.GetObjNum(),old_form.GetObjNum());
  • Because a form XObject may reference other form XObjects you may
    need to loop until the set of form XObjects to process is empty.

Algorithm B) Flatten all form XObject into the page content stream.
This approach is similar to your current approach except that you
would always skip writing the form element (element_writer.WriteElement
(element)). Because you are writing the display list of the form
xobject directly into the new page content stream there is no need to
reference the old form xobject. You would also need to save graphics
state and insert an extra clipping path before processing children
element (using BBox info from the form). For example:

case Element::e_form:
{
// Save GState …
writer.WriteElement(element_builder.CreateGroupBegin());

Obj bbox = element.GetXObject().FindObj(“BBox”);
if (bbox != null) {
// todo may need to transform the clip using form’s Matrix…
// Element cliprect = element_builder.CreateRect(Rect(bbox));
// writer.WriteElement(cliprect);
}

reader.FormBegin();
convert(doc, reader, writer,false,elenum);
reader.End();

writer.WriteElement(element_builder.CreateGroupEnd());
}
}

if (elemet_type === Element::e_form)
… don’t write element
else writer.WriteElement(e);

The only problem with the second approach is that it may result in
bloated PDF files because shared xobject would be replicated.