How to programmatically extract text within a given rectangle (x, y coordinates)?

Question:

How can we programmatically extract all text within a given rectangle (coordinates on top-left and bottom-right corners)?

Answer:

Text can be extracted programmatically with a given x, y coordinates by simply filtering out the array of coordinates, and then use the filtered array to concatenate characters into a string. WebViewer stores x and y coordinates of each character as an array and it also stores all text as a single string.

A custom function can be built by using PDFTron SDK’s low-level API methods. For example by using loadPageText and getTextPosition method of the document instance. Here is one possible solution, where you can pass page number and your coordinates within the PDF page, and function will return the text.

This code shows how to extract text with given coordinates on top-left and bottom-right corners:

viewerElement.addEventListener('documentLoaded', async () => {
const { docViewer } = viewer.getInstance();
const doc = docViewer.getDocument();

const top_x = 310, top_y = 320;
const bottom_x = 250, bottom_y = 150;
const pageIndex = 0;

const text = await extractText(doc, pageIndex, top_x, top_y, bottom_x, bottom_y);
console.log(text);
});

const extractText = (doc, pageIndex, top_x, top_y, bottom_x, bottom_y) => {
return new Promise(resolve => {
doc.loadPageText(pageIndex, text => {
doc.getTextPosition(pageIndex, 0, text.length, (arr) => {

// temp array to store the position of characters
var indies = []

// filter out array with given x, y coordinates
arr = arr.filter((item, index) => {
// replace this if statement from the previous message
// if (item.x4 >= top_x && item.y4 >= top_y && tem.x2 <= (top_x + bottom_x) && item.y2 <= (top_y + bottom_y)) {

// with:
if (item.x4 >= top_x && item.y4 >= top_y && item.x2 <= bottom_x && item.y2 <= bottom_y) {
indies.push(index)
return true;
}
return false;
})

// concatenate chars into string
let str = '';
for (let i = 0, len = indies.length; i < len; i++) {
str += text[indies[i]];
}

// filtered arr can be used for other purposes, e.g. debugging

// return/resolve concatenated string
resolve(str)
});
});
});
}

Here is the screenshot, showing the result

1 Like