What is the logic of loadPageText when the text in pdf have table or column

nuttakit · May 29, 2021, 3:00pm

for example the text in this pdf

system · May 29, 2021, 3:01pm

Hello, I’m Ron, an automated tech support bot

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:

APIs:

Forums:

Couldn’t download the pdf in the webviewer UI in React app after adding free Text annotation programmatically in the app
How to programmatically extract text within a given rectangle (x, y coordinates)?
Is it possible to set default search options for all searches?

Matt_Parizeau · June 1, 2021, 9:40pm

You can read more about text extraction and PDFs here https://www.pdftron.com/documentation/web/guides/extraction/text-extract/

Copied from the top of that page:

Text extraction is based on a inhouse heuristic algorithm which attempts to find the human readable reading order in a document. The reading order is determined by a number of factors such as spacing, font size, font type, and more. What makes text extraction challenging is there is no clear definition in the PDF specification which describes semantic information or logical structures.

Text extraction reading ordering is not defined in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar in a typical PDF file. This means each PDF vendor is left to their own design/solution and will extract text with some differences. Therefore, reading order is not guaranteed to match the order that a typical user reading the document would follow.

The reading order of a magazine, newspaper article, and an academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document. Where different users may have different expectations of the correct reading order.