Bad UTF16 - leading low surrogate

Antony_Ducommun · April 25, 2020, 11:42am

We get this error while using TextExtractor.GetAsText on some documents.

Is there a safe way to bypass / ignore this problem? We would be happy if the page in question cannot be extracted and just returns an empty string for instance. The best would be to extract what could be extracted.

Here, the program abruptly terminate and we cannot catch it at all:

terminate called after throwing an instance of ‘pdftron::Common::Exception’
what(): Exception:
Message: Bad UTF16 - leading low surrogate
Conditional expression: hiUnit <= 0xDBFF
Version : 7.1.0.74119
Platform : Linux
Architecture : AMD64
Filename : UnicodeUtils.cpp
Function : CodePoint_from_UTF16Nat_Surrogate
Linenumber : 1305

Aborted (core dumped)

Ryan · April 28, 2020, 5:37pm

This issue should already be resolved.

Please try running our latest stable production build.
http://nightly.pdftron.com.s3.amazonaws.com/stable/2020-04-27/7.1/PDFNetC64_2020-04-27_stable_rev74739.tar.gz

Antony_Ducommun · April 29, 2020, 8:42am

But that’s not yet release on the main website, correct?

https://www.pdftron.com/documentation/linux/download/linux/

What’s the implication of installing a ‘nightly’ build? Shouldn’t we wait you release it on the main website before putting it in production?

Ryan · April 29, 2020, 4:56pm

No, it is fine. The one on the site is literally one of the nightly builds. We just update the website link when its deemed important, either for new feature releases or if there is an important fix.

Antony_Ducommun · May 12, 2020, 11:06am

We tried with nightly, but still having this error:

terminate called after throwing an instance of 'pdftron::Common::Exception'
  what():  Exception: 
	 Message: Bad UTF16 - leading low surrogate
	 Conditional expression: hiUnit <= 0xDBFF
	 Version      : 7.1.1.74739
	 Platform     : Linux
	 Architecture : AMD64
	 Filename     : UnicodeUtils.cpp
	 Function     : CodePoint_from_UTF16Nat_Surrogate
	 Linenumber   : 1305

Ryan · May 13, 2020, 4:02pm

We would need access to the file(s) then.

Can you post here, or submit confidentially here: https://www.pdftron.com/form/request/

ben · December 9, 2022, 9:21pm

I’m facing this issue as well. I realize this was over two years ago, but is there a record of a resolution?

libc++abi: terminating with uncaught exception of type pdftron::Common::Exception: Exception: 
         Message: Bad UTF16 - leading low surrogate
         Conditional expression: hiUnit <= 0xDBFF
         Version      : 9.4.0-fd07a7defc
         Platform     : OSX
         Architecture : AMD64
         Filename     : UnicodeUtils.cpp
         Function     : CodePoint_from_UTF16Nat_Surrogate
         Linenumber   : 1623

SIGABRT: abort
PC=0x7ff80406430e m=16 sigcode=0
signal arrived during cgo execution

Edit: I’m using the Go package, if that makes a difference.

Thanks,
Ben

Ryan · December 10, 2022, 12:51am

Similar to above, we would need to see the input file itself, either here, or if confidential, submit a ticket here: