How do I extract text from PDF forms?

Aaron_Gravesdale · September 29, 2009, 6:36pm

Q: I am using PDFNet SDK (http://www.pdftron.com/pdfnet) for PDF text
extraction and it is working very well for text extraction from static
PDFs. Recently I run into some PDFs that contain forms and would also
like to extract this text. Is this possible?
------------
A: At the moment pdftron.PDF.TextExtractor (http://www.pdftron.com/
pdfnet/samplecode.html#TextExtract) only extracts text from the page
content, however you could easily extract text from form fields by
iterating through all annotations on a given page and extracting the
values from any text fields. For example:

// C# pseudocode based on AnnotationTest sample
// (http://www.pdftron.com/pdfnet/samplecode.html#Annotation).

string text_from_forms = "";

int num_annots = page.GetNumAnnots();
for (int i=0; i<num_annots; ++i) {
  Annot annot = page.GetAnnot(i);
  if (annot.IsValid() == false) continue;
  if (annot.GetType() == Annot.Type.e_Widget) {
    // positioning info, if required ... Rect bbox = annot.GetRect();
    pdftron.PDF.Annots.Widget w=new pdftron.PDF.Annots.Widget(annot);
    Field f = w.GetField();
    if (f.GetValue() !=null)
      text_from_forms += f.GetValueAsString() + "\n";
  }
}

Aaron_Gravesdale · September 29, 2009, 6:43pm

Another option is that you flatten PDF forms before text extraction.
For example:

pdfdoc.FlattenFields();
... use pdftron.PDF.TextExtractor as usual (http://www.pdftron.com/
pdfnet/samplecode.html#TextExtract)

Aaron_Gravesdale · December 2, 2013, 10:00pm

Q: We need the text extractor to acquire the text from a certain area
(In the form of a rectangle) in the first page of the PDF and the text
should be extracted exactly as seen on the page. With these
limitations it seems to me that flattening the fields is better for
us.

Since we only need to extract the text from the first page I tried
doing that and here’s my code:
Dim page As Page = doc.GetPage(1)

'Flatten form fields
Dim iAnnots As Integer = page.GetNumAnnots
Dim annot As Annot
Dim w As Annots.Widget
For i As Integer = 0 To iAnnots - 1
annot = page.GetAnnot(i)
If annot.IsValid _
AndAlso annot.GetType = PDF.Annot.Type.e_Widget Then
w = New Annots.Widget(annot)
If w.GetField.GetType = Field.Type.e_text Then
annot.Flatten(page)
End If
End If
Next

Please let me know if :

This code is sufficient to flatten any textual content on this
page.
Performance-wise, is this a good way to do this? I ran some rough
tests and it seems fast enough, but what happens if there are a lot of
textual annotations-widgets-textual fields in the document?
I might want to only flatten the annotations that are contained
within our desired area. I’m thinking of checking the intersection of
annot.GetRect and our rectangle and flatten if the result is not
empty, is this a good way?

A: Flattening annotations in PDFNet is very fast so I wouldn’t worry
about the performance.
The simplest way to flatten form fields is using doc.FlattenAnnotations().
(Pass FlattenAnnotations() an argument of “true” to flatten only fields
and no other annotations.)

If you want to flatten all annotation types you could use the
following snippet:

For i As Integer = 1 To doc.GetPageCount()
Dim pg As Page = doc.GetPage(i)
For j As Integer = pg.GetNumAnnots() - 1 To 0 Step -1
Dim ann As Annot = pg.GetAnnot(j)
ann.Flatten(pg)
Next
Next

C#/JAVA/C/C++ is along the same lines.

Please note that annotations must be flattened in the reverse order
(since Annot.Flatten() method removes the object from the annotation
array in the page dictionary).

You could also extend the code above to skip some annotations (e.g.
based on their annot.GetRect()), but my guess is that the performance
gain would not be very significant.