How do I get a number of words on PDF page?

Aaron_Gravesdale · July 31, 2007, 10:10pm

Q:

I want to obtain the total number of words in a PDF file. For this we
need to deal with the text layer in PDF. Can you please give me any
references for identifying word from text layer. We get text layer
information from 'Element::e_text'. e_text element can represent a
single word, a single line, a line having random no. of words, etc.
But this information is not sufficient to identify the word.
-----
A:

There are couple of options.

You could use PDFNet SDK content extraction API to recognize words
from text runs (taking into account positioning information etc - see
http://www.pdftron.com/net/faq.html#text_00). This is not a trivial
task though.

A simpler way is to use pdftron.PDF.TextExtractor API, which is
available in PDFNet SDK v.3.7 and above (to download a preview version
please use the following link - www.pdftron.com/downloads/PDFNetPreviewDemo.zip).

TextExtractor class can be used to reconstruct words and to obtain a
word count for a PDF page.
The following sample code illustrates how to extract words and
positioning information from a PDF page and how to obtain the word
count for the page. You can find updated code in TextExtract sample
project.

...
PDFDoc doc = new PDFDoc("my.pdf");
doc.InitSecurityHandler();

TextExtractor txt = new TextExtractor();
PageIterator itr = doc.PageFind(1);
Rect word_bbox = new Rect();
if (itr != doc.PageEnd()) {
txt.Begin(itr.Current());

// Example 1. Get the word count.
Console.WriteLine("Word Count: {0}\n\n", txt.GetWordCount());

  // Example 2. Extract words one by one.
  String word;
  while ((word = txt.GetNextWord(word_bbox)) != null) {
    Console.WriteLine("{0} \t bbox: {1}, {2}, {3}, {4}\n",
        word, word_bbox.x1, word_bbox.y1, word_bbox.x2,
word_bbox.y2);
  }

  // Example 3. Get all words as a single string.
  // Words will be separated with space (i.e. ' ') or new line (i.e.
'\n') characters.
  String text = txt.GetAsText();
  Console.WriteLine("\n\n- GetAsText --------------- \n{0}\n", text);

  // Example 4. Return XML ligical structure for the page.
  text = txt.GetAsXML(false, true);
  Console.WriteLine("\n\n- GetAsXML -------------------- \n{0}\n",
text);
}
else Console.WriteLine("Page not found.");
doc.Close();