Searching a table inside PDF is different (and worse) than in PDFexpress

WebViewer Version: 8.6.0

Do you have an issue with a specific file(s)? No
Can you reproduce using one of our samples or online demos? Yes
Are you using the WebViewer server? Yes (assuming the demo is using it)
Does the issue only happen on certain browsers? No
Is your issue related to a front-end framework? No
Is your issue related to annotations? No

Please give a brief summary of your issue:

Searching content in a table inside a PDF is hardly usable, because the text is not parsed in the same way it is displayed (jumping around, instead of row by row)

Please describe your issue and provide steps to reproduce it:

  1. Obtain a PDF file with a table inside, e.g. a sample from DataTables example - PDF - image (just click the “PDF” button)
  2. Open in PDFtron demo
  3. Search for a term that spans across more than one column
  4. Observe: no results
  5. Search for a term that is just within one column
  6. Observe: the search results show that pdftron has parsed the table in a very jagged way, jumping between columns within a row, etc.

Now compare this to the PDFjsexpress demo. Here the parsing/searching works as expected.

Unfortunately, this blocks us from upgrading to PDFtron at the moment. Any ideas how this can be worked around?

Hello, I’m Ron, an automated tech support bot :robot:

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:APIs:Forums:

Hello @andre

I’ve tried to reproduce the issue following the steps you wrote but I could not. WebViewer search works as expected. Please see the images below:


Am I missing something or doing anything different from what you are doing?

Hi @dfelix, thanks for looking into this!
It seems you were incredibly lucky. :wink: Take almost any other example and it fails as described:


@dfelix any news on this issue? Thanks!

Please download this latest Stable version of the SDK

https://nightly-pdftron.s3-us-west-2.amazonaws.com/stable/2022-08-08/webviewer/WebViewer-8.7.0_2022-08-08_stable.zip

Latest official builds, and Release channel, are ready for production usage, however the developer channel builds do not get the same amount of testing and can be in a state of change.

And then add the following code to use an alternate means of extracting/parsing text from a PDF.

const { documentViewer, annotationManager, TextExtractorProcessingFlags } = instance.Core;
  documentViewer.addEventListener('documentLoaded', () => {
      documentViewer.getDocument().setTextExtractorProcessingFlags([TextExtractorProcessingFlags.EXTRACT_USING_ZORDER]);
      documentViewer.getDocument().refreshTextData();
    });

Thanks @Ryan, that sounds like just the right way to do it. Will give it a try over the next days.

Hi @Ryan,

I finally got around to testing your suggestion. Instead of the nightly build, I used the 8.7.0 release version.
Unfortunately, it seems that this setting has no impact on the problem at hand.
The search results are appearing in the exact same order for me.

Is there any way to verify if the flag has been activated properly? Do you maybe have a test case I can try to make sure that I got it implemented correctly?

Thank you!

I followed the steps I described earlier and it worked fine for me.

Do you maybe have a test case I can try to make sure that I got it implemented correctly?

Yes, do the following.

  1. Download the WebViewer SDK link I provided earlier.
  2. Unzip in a folder your localhost serves.
  3. Replace the samples/viewing/viewing/viewing.js with the one attached.
  4. Open the sample above and try text searching.

viewing.js.txt (1.1 KB)

Thanks @Ryan for the full example! I can indeed verify your code.

However, replacing the library in your example with the release version of 8.7.0, it no longer works…
I tried both installing via npm and downloading the SDK.
Did this feature maybe not make it to the release?
Strangely, I’m not getting any typescript linter issues – EXTRACT_USING_ZORDER is properly defined.

Hi @andre ,

Which build of WebViewer are you using (can I get a link to it)? I tested the latest nightly and 8.7 build on npm and both seems to be working with the “EXTRACT_USING_ZORDER” option

PDFTron nightly builds

npm install @pdftron/webviewer@8.7-nightly

^when using npm, make sure the public static folder WebViewer is pointing to is up to date

Searching with EXTRACT_USING_ZORDER enable wasn’t working in the official 8.7 release but has been fixed in the nightly builds. It should also be in the upcoming 8.8 release which should come out in the next few weeks. It could be possible that your browser is caching an older version of WebViewer without the fix

Hi @Andrew_Yip,

yes, with the nightly build (using WebViewer-8.7.0_2022-08-24_stable.zip), the zorder is now working as advertised.

The one I tried before was the release version, installed via npm install @pdftron/webviewer@8.7.0. Thanks for pointing out that that was incomplete.

Looking forward to the new release!

Hi,

Thank you for the update and I’m glad that worked for you. We are in the middle of testing and bug fixing for our next 8.8 release currently. Hopefully, it should be released by the end of next week but we could find something that could delay the release.

Best Regards