PDF text extraction for text encoded using PUA codes from Adobe, Apple, and IBM.

Q:

We have downloaded your PDFNet SDK and it corresponds to ours needs.
The application we write extracts some text from the document and add
an URL annotation given by a request to a csv or a xml file. It works
well but in some pdf documents parts of the text are encoded in a way
that we cannot read them in string format. Attached is an example of
such a document where the part numbers cannot be read.
----

A:

The problem is that some PDF documents use vendor specific encodings
which is mapping charcodes to Unicode private area. In a more recent
build of PDFNet (for .NET) we added ability to recognize commonly used
corporate private Unicode areas from Adobe, Apple, and IBM. Although
this type of glyph naming is currently deprecated by Adobe, there are
many fonts and PDF documents that use this convention. Please let us
know how the latest build (http://www.pdftron.com/downloads/
PDFNetDemo.zip) of PDFNet works for you.