PDF text extraction using PDFNet.

Aaron_Gravesdale · March 10, 2009, 1:39am

Q: My requirement is to find the location on the page of individual
words. The units are real-world page units in inches. I have no
control over the form of the input file.

I have tried several approaches to getting this information. The
function Element.GetBBox() returns a box which is often (but not
always) _much_ larger than the text it is enclosing.

Additionally, both the TextIterator and Element classes both seem to
occasionally return partial words as well as some strings containing
embedded spaces. The text "Hi guy" could show up as a single Element,
2 seperate Elements such as "Hi" and "guy", 5 Elements like "H", "i",
"g", "u", and "y" or even a set like "H", "i gu", and "y".
(I will take care of merging adjacent word-fragments into a single
word in a later phase.) There have also been instances of Elements
containing leading and / or trailing spaces.

Since I need to split my output words when whitespace is seen, I
iterate over the characters. In the following fragment, the (x1,y1)
coordinate (lower-left of 1st character) seems to be correct in all
cases but I have not been able to come up with a general method for
getting the (x4,y4) coordinate (upper-right of last non-space
character) for all font types.
It is likely due to the fact that I do not completely understand the
different coordinate spaces used. Methods which seemed to work broke
when the input used different font types.
Also, the only reference I found in the PDFTron groups referenced
functions like CharBegin() and CharEnd() which seem to be obsolete.

// At this point, "elem" is e_text.
void DumpTextElement(Element & elem)
{
  bool bInString = false;
  GState pgstate;
  Font font;
  pdftron::PDF::Font::Type fontType;

Matrix2D text_mtx, mtx, ctm;
CharIterator itr, firstNonBlank, lastNonBlank;

CString sOut;
unsigned int ch_code;

  pgstate = elem.GetGState();
  font = pgstate.GetFont();
  fontType = font.GetType();
  if ( !font.IsSimple() )
    return;
  if ( !elem.HasTextMatrix() )
    return;

  text_mtx = elem.GetTextMatrix();
  ctm = elem.GetCTM();
  mtx = ctm * text_mtx;

  bInString = false;
  for ( itr = elem.GetCharIterator(); itr.HasNext(); itr.Next() )
  {
    // Get the current character and position.
    ch_code = itr.Current().char_code;
    if ( ch_code != ' ' )
    { // In a word
      lastNonBlank = itr;
      sOut += ch_code;
      if ( !bInString ) {
        bInString = true;
        firstNonBlank = itr;
      }
    } else { // In whitespace
      if ( bInString ) {
        OutputWord(font, mtx, sOut, firstNonBlank, lastNonBlank);
        bInString = false;
        sOut = "";
      }
    }
  } // for ( ..itr.. )
  if ( bInString ) {
    OutputWord(font, mtx, sOut, firstNonBlank, lastNonBlank);
  }
  return;
}

void OutputWord(Font & rFont, Matrix2D & rMtx, CString & rsOut,
    CharIterator & itrChStart, CharIterator & itrChEnd) {
  pdftron::PDF::Font::Type fontType;
  Matrix2D type3FontMatrix = Common::Matrix2D::IdentityMatrix();
  Rect fontbbox;

  double x1, y1, x2, y2, x3, y3, x4, y4;
  double dAscent, dDescent, dLastCharWidth;
  UInt16 UnitsPerEM = -1;

  fontType = rFont.GetType();
  fontbbox = rFont.GetBBox();
  if (fontType == Font::Type::e_Type3) {
    type3FontMatrix = rFont.GetType3FontMatrix();
  } else {
    return; // I WANT THIS TO WORK FOR ANY FONT TYPE. *********
  }

  // Get lower-left of first non-blank character. (x1,y1)
  x1 = itrChStart.Current().x;
  y1 = itrChStart.Current().y;

  // Get lower-left of last non-blank character. (x2,y2)
  x2 = itrChEnd.Current().x;
  y2 = itrChEnd.Current().y;

  dLastCharWidth = rFont.GetWidth(itrChEnd.Current().char_code);
  dAscent = rFont.GetAscent();
  dDescent = rFont.GetDescent();
  { THIS SECTION IS SUSPECT *************************************
    // Get lower-right of last non-blank character. (x3,y3)
    x3 = x2 + (dLastCharWidth * type3FontMatrix.m_a * 72.0);
    y3 = y2;

    // Get upper-right of last non-blank character. (x4,y4)
    x4 = x3;
    y4 = y3 + ((fontbbox.y2 - fontbbox.y1) * type3FontMatrix.m_d *
72.0);
  } //***********************************************************

  // Convert to "user space" (scale and / or rotate as necessary).
  rMtx.Mult(x1, y1);
  rMtx.Mult(x4, y4);
  // Convert to inches.
  x1 /= 72.0; y1 /= 72.0;
  x4 /= 72.0; y4 /= 72.0;

cout << x1 << "," << y1 << "," << x4 << "," << y4 << "," << rsOut <<
endl; }
-------------
A: Did you try using pdftron.PDF.TextExtractor as shown in first
couple of snippets of TextExtract sample project (http://
www.pdftron.com/net/samplecode.html#TextExtract)?

TextExtractor is doing lots of dark magic required to reconstruct
meaningful text from PDF documents :). Besides text TextExtractor
offers acces to positioning information for every character, word,
line, ... as well as associated font and styling information.