Web viewer can't find search if search term is across 2 pages

mark.aziz · August 25, 2022, 1:33am

WebViewer Version: 8.7.0

Please give a brief summary of your issue:
Web viewer can’t find search if search term is across 2 pages

Please describe your issue and provide steps to reproduce it:
I’m using Web Viewer for React.
If i do a search for a sentence on the web viewer and that sentence spans across 2 pages (e.g: it starts near the end of page 1 and it ends on page 2), it does not get found by my search. I tried using both the UI search bar and I tried programatically by calling “textSearchInit” on the documentViewer. Neither approach worked. Is there a way to do this?

Here is the code I used:

const mode = Search.Mode.PAGE_STOP | Search.Mode.HIGHLIGHT;
const searchOptions = {
	fullSearch: true,
	onResult: (result) => {
		if (result.resultCode === Search.ResultCode.FOUND) {
			documentViewer.displaySearchResult(result);
		}
	},
};

documentViewer.textSearchInit(searchTerm, mode, searchOptions);

system · August 25, 2022, 1:33am

Hello, I’m Ron, an automated tech support bot

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:

APIs:

Forums:

tgordon · August 25, 2022, 6:35pm

Hello mark.aziz,

This is a problem with Adobe and Chrome as well, and is expected, however there are potential solutions to this.

Text extraction/ordering is not defined at all in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar, in a typical PDF file. This means each PDF vendor is left to their own design/implementation, and will extract text differently.

Background:
This is a difficult problem because PDFs don’t have a concept of reading order, so the order that PDFTron extracts the text may not always align with what a user is expecting.

For example the document has a header, content and footer on each page, a user might expect search results to extend across pages only based on the content, not the header and footer, but the extracted order would end with the footer and start with the header.

How this can be handled
The way that you might be able to do this with WebViewer is by using the lower level text APIs to get the text from each page, append them together and then do a string search through the combined text.

You can see our text position sample which shows how to use the loadPageText and getTextPosition functions PDFTron Systems Inc. | Documentation

One option is to extract all the text from the PDF, and pass to a dedicated search engine, usually done server side. From there you could find more complicated results and then try and map back to the bounding boxes in the PDF for highlighting, but this mapping back is non-trivial and possibly error prone (especially for complicated text with diacritics, etc., like Arabic or Thai).

Best regards,
Tyler Gordon
Web Development Support Engineer
PDFTron