Extracting word coordinates from PDF.

Aaron_Gravesdale · November 14, 2011, 8:57pm

Q:

I am trying to extract the coords of the words in a pdf documento, but
the coords are bigger than the page size…
what am i doing wrong? is this a bug?

in the attachment is the pdf, and the simple code i use to get, X, Y, Width and Heigth

A:

First of all, I am not too sure about the pageBitmap, since from the code, I cannot see how it is constructed. Instead, I wrote the following test code:

TextExtractor te = new TextExtractor();

te.Begin(page);

Rect br = page.GetBox(Page.Box.e_crop);

TextExtractor.Line line;

TextExtractor.Word word;

for ( line = te.GetFirstLine(); line.IsValid(); line = line.GetNextLine() ) {

for ( word = line.GetFirstWord(); word.IsValid(); word = word.GetNextWord() ) {

if ( word.GetStringLen() == 0 ) continue;

Rect r = word.GetBBox();

if (r.x1 < br.x1 || r.x2 > br.x2 ||

r.y1 < br.y1 || r.y2 > br.y2)

{

Console.WriteLine(“exceeds page bounding box.”);

}

With this test code on the first page, it did happen once that a word’s bounding box exceeds the page’s bounding box and the values can be seen from the attached file. It looks to me that the x1 value is slightly outside the page bounding box.

Now note that an element on a page can be outside the page bounding box. It is just that the region outside the bounding box will be cropped. Just for testing, if you use Rect br = page.GetBox(Page.Box.e_media), you will find that all words are within it.