Q: I need to extract text from a PDF at specific locations. I specify
the locations through rectangles whose corner coordinates I pass to
PDFView.SelectByRect() or SelectByStruct).
A general question:
- When I need to extract text from a PDF without displaying it, I hate
instanciating a PDFView control just for that. Is there a way to
extract the text on a lower level. Ideally, those functions would
reside in pdftron::PDF::Page. My concern is performance (I have to
loop through a couple million pages and have to do it quickly) as well
as memory usage.
Specific questions about PDFView.SelectByRect():
What are the four parameters x1, y1, x2, y2 exactly? The apiref.html
says briefly "PDF coordinates". But my experiments show that the
origin for those values has to be the TOP LEFT corner of the PDF -
which is different from the usual bottom left corner. Please mention
in your reply what x2, y2 is supposed to be, too.
I need to select text EXACTLY by a rectangle. No characters outside of
the rectangle may be returned. Unfortunately, SelectByRect() always
returns whole words. How can I set the granularity to character level,
so that only characters intersecting with my coordinates are returned?
SelectByStruct() seemed promising, but has the (for me unwanted) side-
effect of selection whole horizontal lines.
A: Regarding text extraction from a rectangle the right approach is to
use TextExtracor class (as shown in TextExtract sample project -
You can either pass an optional clipping rectangle as the second
parameter in text_extractor.Begin(page, box) method or you can iterate
through all words on the page and test for intersection between word's
bounding box (word.GetBBox()) and the selection rectangle. Either of
these will be very fast and more memory efficient than using PDFView.
What are the four parameters x1, y1, x2, y2 in pdfview.SelectByStruct() exactly?
These are coordinates for the selection rectangle in screen
coordinates (not PDF coordinates - thanks for pointing this out). The
origin of the screen coordinate system is top left corner and it is
using pixel coordinates.
I need to select text EXACTLY by a rectangle. No characters
outside of the rectangle may be returned.
To achieve this, use TextExtractor class to extract text from PDF,
pass the selection rectangle as the second parameter, and
TextExtractor.ProcessingFlags.e_remove_hidden_text as the third
parameter in the call to text_extractor.Begin(page, select,