for example the text in this pdf
Hello, I’m Ron, an automated tech support bot
While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:Guides:
- Get text position in PDF documents - Getting text position
- Annotations. WidgetFlags - new WidgetFlags(options)
- PDFNet. Flattener - new Flattener()
- PDFNet. ContentReplacer - new ContentReplacer()
You can read more about text extraction and PDFs here https://www.pdftron.com/documentation/web/guides/extraction/text-extract/
Copied from the top of that page:
Text extraction is based on a inhouse heuristic algorithm which attempts to find the human readable reading order in a document. The reading order is determined by a number of factors such as spacing, font size, font type, and more. What makes text extraction challenging is there is no clear definition in the PDF specification which describes semantic information or logical structures.
Text extraction reading ordering is not defined in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar in a typical PDF file. This means each PDF vendor is left to their own design/solution and will extract text with some differences. Therefore, reading order is not guaranteed to match the order that a typical user reading the document would follow.
The reading order of a magazine, newspaper article, and an academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document. Where different users may have different expectations of the correct reading order.