Convert highlights stored using char offsets in the content stream to rectangles/quads in PDF coords

Q:

We need a way to convert highlights stored using char offsets in the content stream to rectangles/quads so that we can display on the screen. Here is more background info:

  • We are currently using a 3rd party company to handle annotations. This product stores the data on strikethrough & highlight annotations are a series of character offsets on a page. So if a user has highlighted the word “Bo” in “Hello, my name is Bo”, the annotations meta data would be something like “18,25” — highlight starting on character 18 and end at character 20.

  • We are trying to use these same offsets to programmatically annotate the same document on Android using your library. So that is why it is important for us to be able to calculate the number of characters on a page (inc spaces) and the bounding boxes for each character.

We would also like to be able to convert from PDF highlights annotations to char offsets. Is this possible with PDFNet?

A:

You may want to take a look at pdftron::PDF::Highlights: https://www.pdftron.com/pdfnet/docs/PDFNetC/dc/d85/classpdftron_1_1_p_d_f_1_1_highlights.html

The issue with this format (and the reason it is no longer supported by many apps) is that it is hard coded to a specific content stream (for char offsets) or even specific text extraction algorithm (for word offsets; this one is especially bad since Adobe’s text extraction algorithms are completely undocumented).

You can use pdftron.PDF.Highlights.Load(“my.xml”) to load the file then read quads. You can PDFViewCtrl.select(Highlights) to tell the viewer to select text or use the positioning info to create Highlight annotations that you can export to XFDF.

The following is a relevant code snippet from TextSearch sample:

Highlights hlts = …

hlts.Begin(doc);

while ( hlts.HasNext() ) {

Page cur_page= doc.GetPage(hlts.GetCurrentPageNumber());

const double *quads;

int quad_count = hlts.GetCurrentQuads(quads);

for ( int i = 0; i < quad_count; ++i ) {

//assume each quad is an axis-aligned rectangle

const double q = &quads[8i];

double x1 = min(min(min(q[0], q[2]), q[4]), q[6]);

double x2 = max(max(max(q[0], q[2]), q[4]), q[6]);

double y1 = min(min(min(q[1], q[3]), q[5]), q[7]);

double y2 = max(max(max(q[1], q[3]), q[5]), q[7]);

Annots::Link hyper_link = Annots::link::Create(doc, Rect(x1, y1, x2, y2), Action::CreateURI(doc, “http://www.pdftron.com”));

cur_page.AnnotPushBack(hyper_link);

}

hlts.Next();

}

For other language variants see: https://www.pdftron.com/pdfnet/samplecode.html#TextSearch

In terms of going the other way round from quads to offset XML it seems the only way to publically produce pdftron.PDF.Highlights so you can serialize them via Highlights.save() is via TextSearch. Unfortunately this may very helpful if you are staring from a rectangles.

To solve this perhaps you can get text under annotation using pdftron.PDF.TextExtractor.GetTextUnderAnnot (Rect) .

https://www.pdftron.com/pdfnet/docs/PDFNetC/d3/d88/classpdftron_1_1_p_d_f_1_1_text_extractor.html#a6da304d82307150a5eff1b596e3b9c73

Then use the resulting string in TextSearch to get relevant ‘Highlights’ (you should check that bboxes for matching text overlaps your original highlight region)… then get char offsets in Adobe Highlights format.