Searching a table inside PDF is different (and worse) than in PDFexpress

andre · June 28, 2022, 4:59pm

WebViewer Version: 8.6.0

Do you have an issue with a specific file(s)? No
Can you reproduce using one of our samples or online demos? Yes
Are you using the WebViewer server? Yes (assuming the demo is using it)
Does the issue only happen on certain browsers? No
Is your issue related to a front-end framework? No
Is your issue related to annotations? No

Please give a brief summary of your issue:

Searching content in a table inside a PDF is hardly usable, because the text is not parsed in the same way it is displayed (jumping around, instead of row by row)

Please describe your issue and provide steps to reproduce it:

Obtain a PDF file with a table inside, e.g. a sample from DataTables example - PDF - image (just click the “PDF” button)
Open in PDFtron demo
Search for a term that spans across more than one column
Observe: no results
Search for a term that is just within one column
Observe: the search results show that pdftron has parsed the table in a very jagged way, jumping between columns within a row, etc.

Now compare this to the PDFjsexpress demo. Here the parsing/searching works as expected.

Unfortunately, this blocks us from upgrading to PDFtron at the moment. Any ideas how this can be worked around?

system · June 28, 2022, 4:59pm

Hello, I’m Ron, an automated tech support bot

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:

APIs:

Forums:

dfelix · June 29, 2022, 7:58pm

Hello @andre

I’ve tried to reproduce the issue following the steps you wrote but I could not. WebViewer search works as expected. Please see the images below:

Am I missing something or doing anything different from what you are doing?

andre · June 30, 2022, 8:42am

Hi @dfelix, thanks for looking into this!
It seems you were incredibly lucky. Take almost any other example and it fails as described:

andre · August 9, 2022, 3:00pm

@dfelix any news on this issue? Thanks!

Ryan · August 9, 2022, 8:42pm

Please download this latest Stable version of the SDK

https://nightly-pdftron.s3-us-west-2.amazonaws.com/stable/2022-08-08/webviewer/WebViewer-8.7.0_2022-08-08_stable.zip

Latest official builds, and Release channel, are ready for production usage, however the developer channel builds do not get the same amount of testing and can be in a state of change.

And then add the following code to use an alternate means of extracting/parsing text from a PDF.

const { documentViewer, annotationManager, TextExtractorProcessingFlags } = instance.Core;
  documentViewer.addEventListener('documentLoaded', () => {
      documentViewer.getDocument().setTextExtractorProcessingFlags([TextExtractorProcessingFlags.EXTRACT_USING_ZORDER]);
      documentViewer.getDocument().refreshTextData();
    });

andre · August 11, 2022, 10:58am

Thanks @Ryan, that sounds like just the right way to do it. Will give it a try over the next days.

andre · August 19, 2022, 1:17pm

Hi @Ryan,

I finally got around to testing your suggestion. Instead of the nightly build, I used the 8.7.0 release version.
Unfortunately, it seems that this setting has no impact on the problem at hand.
The search results are appearing in the exact same order for me.

Is there any way to verify if the flag has been activated properly? Do you maybe have a test case I can try to make sure that I got it implemented correctly?

Thank you!

Ryan · August 19, 2022, 4:57pm

I followed the steps I described earlier and it worked fine for me.

Do you maybe have a test case I can try to make sure that I got it implemented correctly?

Yes, do the following.

Download the WebViewer SDK link I provided earlier.
Unzip in a folder your localhost serves.
Replace the samples/viewing/viewing/viewing.js with the one attached.
Open the sample above and try text searching.

viewing.js.txt (1.1 KB)

andre · August 21, 2022, 7:59pm

Thanks @Ryan for the full example! I can indeed verify your code.

However, replacing the library in your example with the release version of 8.7.0, it no longer works…
I tried both installing via npm and downloading the SDK.
Did this feature maybe not make it to the release?
Strangely, I’m not getting any typescript linter issues – EXTRACT_USING_ZORDER is properly defined.

Andrew_Yip · August 24, 2022, 9:50pm

Hi @andre ,

Which build of WebViewer are you using (can I get a link to it)? I tested the latest nightly and 8.7 build on npm and both seems to be working with the “EXTRACT_USING_ZORDER” option

PDFTron nightly builds

npm install @pdftron/webviewer@8.7-nightly

^when using npm, make sure the public static folder WebViewer is pointing to is up to date

Searching with EXTRACT_USING_ZORDER enable wasn’t working in the official 8.7 release but has been fixed in the nightly builds. It should also be in the upcoming 8.8 release which should come out in the next few weeks. It could be possible that your browser is caching an older version of WebViewer without the fix

andre · August 25, 2022, 11:20am

Hi @Andrew_Yip,

yes, with the nightly build (using WebViewer-8.7.0_2022-08-24_stable.zip), the zorder is now working as advertised.

The one I tried before was the release version, installed via npm install @pdftron/webviewer@8.7.0. Thanks for pointing out that that was incomplete.

Looking forward to the new release!

Andrew_Yip · August 26, 2022, 12:57am

Hi,

Thank you for the update and I’m glad that worked for you. We are in the middle of testing and bug fixing for our next 8.8 release currently. Hopefully, it should be released by the end of next week but we could find something that could delay the release.

Best Regards