How do I fetch all text elements from a PDF document?

Aaron_Gravesdale · June 26, 2007, 7:53pm

Q:

while iterating the elements of the attached pdf, i get only a few
text elements out of the
document (the one line address from left-top, the address from the
right-top and the footer printed on red color).

What do i have to do to fetch all the text from this document?

the iteration is done in the following way
page_reader.Begin(page);

while ((element = page_reader.Next()) != null)
{ // Read page contents

    switch (element.GetType())
    {
        case Element.ElementType.e_path:
        { // Process path data...
            break;
        }

        case Element.ElementType.e_text_begin:
        { // Process text strings...
            ProcessText(ref page_reader, mediabox, rectZone, ref
chardesArray);
            break;
        }
        case Element.ElementType.e_form:
        { // Process form XObjects
            page_reader.FormBegin();
            break;
        }
        case Element.ElementType.e_image:
        {
            break;
        }
    }
}

where the ProcessText looks like
private void ProcessText(ref ElementReader page_reader, Rect box,
RectangleF rectZone, ref ArrayList charDescriptors)
{
    Element element;
    while ((element = page_reader.Next()) != null)
    {
        switch (element.GetType())
        {
            case Element.ElementType.e_text_end:
                return;
            case Element.ElementType.e_text:
                {
                  // do stuff
                    break;
                }
        }
    }
}
----

A:

The problem is that you are opening a child display list for form
XObject element (i.e. e_form), however it seems that you are never
closing the child list (using page_reader.End()).

For a canonical example of how to process e_form elements, please take
a look at ElementReader or ElementReaderAdv (http://www.pdftron.com/
net/samplecode.html#ElementReader).

if (element.GetType() == Element.Type.e_form) {
  reader.FormBegin();
  ProcessElements(reader);
  reader.End();
  break;
}