Searching a table inside PDF is different (and worse) than in PDFexpress

WebViewer Version: 8.6.0

Do you have an issue with a specific file(s)? No
Can you reproduce using one of our samples or online demos? Yes
Are you using the WebViewer server? Yes (assuming the demo is using it)
Does the issue only happen on certain browsers? No
Is your issue related to a front-end framework? No
Is your issue related to annotations? No

Please give a brief summary of your issue:

Searching content in a table inside a PDF is hardly usable, because the text is not parsed in the same way it is displayed (jumping around, instead of row by row)

Please describe your issue and provide steps to reproduce it:

  1. Obtain a PDF file with a table inside, e.g. a sample from DataTables example - PDF - image (just click the “PDF” button)
  2. Open in PDFtron demo
  3. Search for a term that spans across more than one column
  4. Observe: no results
  5. Search for a term that is just within one column
  6. Observe: the search results show that pdftron has parsed the table in a very jagged way, jumping between columns within a row, etc.

Now compare this to the PDFjsexpress demo. Here the parsing/searching works as expected.

Unfortunately, this blocks us from upgrading to PDFtron at the moment. Any ideas how this can be worked around?

Hello, I’m Ron, an automated tech support bot :robot:

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:APIs:Forums:

Hello @andre

I’ve tried to reproduce the issue following the steps you wrote but I could not. WebViewer search works as expected. Please see the images below:


Am I missing something or doing anything different from what you are doing?

Hi @dfelix, thanks for looking into this!
It seems you were incredibly lucky. :wink: Take almost any other example and it fails as described:


@dfelix any news on this issue? Thanks!

Please download this latest Stable version of the SDK

https://nightly-pdftron.s3-us-west-2.amazonaws.com/stable/2022-08-08/webviewer/WebViewer-8.7.0_2022-08-08_stable.zip

Latest official builds, and Release channel, are ready for production usage, however the developer channel builds do not get the same amount of testing and can be in a state of change.

And then add the following code to use an alternate means of extracting/parsing text from a PDF.

const { documentViewer, annotationManager, TextExtractorProcessingFlags } = instance.Core;
  documentViewer.addEventListener('documentLoaded', () => {
      documentViewer.getDocument().setTextExtractorProcessingFlags([TextExtractorProcessingFlags.EXTRACT_USING_ZORDER]);
      documentViewer.getDocument().refreshTextData();
    });