Q:
while iterating the elements of the attached pdf, i get only a few
text elements out of the
document (the one line address from left-top, the address from the
right-top and the footer printed on red color).
What do i have to do to fetch all the text from this document?
the iteration is done in the following way
page_reader.Begin(page);
while ((element = page_reader.Next()) != null)
{ // Read page contents
switch (element.GetType())
{
case Element.ElementType.e_path:
{ // Process path data...
break;
}
case Element.ElementType.e_text_begin:
{ // Process text strings...
ProcessText(ref page_reader, mediabox, rectZone, ref
chardesArray);
break;
}
case Element.ElementType.e_form:
{ // Process form XObjects
page_reader.FormBegin();
break;
}
case Element.ElementType.e_image:
{
break;
}
}
}
where the ProcessText looks like
private void ProcessText(ref ElementReader page_reader, Rect box,
RectangleF rectZone, ref ArrayList charDescriptors)
{
Element element;
while ((element = page_reader.Next()) != null)
{
switch (element.GetType())
{
case Element.ElementType.e_text_end:
return;
case Element.ElementType.e_text:
{
// do stuff
break;
}
}
}
}
----
A:
The problem is that you are opening a child display list for form
XObject element (i.e. e_form), however it seems that you are never
closing the child list (using page_reader.End()).
For a canonical example of how to process e_form elements, please take
a look at ElementReader or ElementReaderAdv (http://www.pdftron.com/
net/samplecode.html#ElementReader).
if (element.GetType() == Element.Type.e_form) {
reader.FormBegin();
ProcessElements(reader);
reader.End();
break;
}