Why is MapToUnicode returning very high values?


I'm having trouble getting the unicode value of a character in an
embedded font. Here's my code:

ElementReader page_reader;
page_reader.Begin( *page_it );

Element* el;
while( el = page_reader.Next() ) {
if( el->GetType() == Element::e_text ) {
   PDF::Rect rc;
   if( el->GetBBox( rc ) ) {
     PDF::Font font =
     PDF::CharIterator ch_end =el->CharEnd();
     for( PDF::CharIterator ch_it =el->CharBegin(); ch_it != ch_end;
ch_it++ ) {
        PDF::UnicodeArray converted;
        font.MapToUnicode((*ch_it).char_code, converted );
            // !!! converted contains very high values !!!

We looked in the test file and it seems that the text is represented
using custom font encoding which prevents correct Unicode text
extraction (copy and paste from Acrobat also produces garbage).

It is possible that the book author is using custom font encoding as a
way to prevent content/text extraction from the book.


Is it possible to detect if PDF font encoding is custom (i.e. there is
no Unicode mapping for text)?


You can check the value returned by font.MapToUnicode(...). If the
function returns false, Unicode mapping failed and the font is most
likely using custom encoding (or might have corrupt ToUnicode CMap

Also, the font is likely to use custom encoding if font.MapToUnicode()
maps to a Unicode value such as 0, 0xFFFD or a value that lies in one
of private Unicode ranges.

Unicode uni;
int uni_sz;
bool unicode_available = font.MapToUnicode(char_code, uni, uni_sz, 1);
if (unicode_available) { // check if the Unicode value is reasonable.
  if (uni[0]==0 || uni[0] == 0xFFFD ||
      IsInPrivateUnicodeRange(uni[0]) {
        ... most likely custom or corrupt encoding ...

Some API’s only have a simpler version of MapToUnicode that returns a UString object. In these cases you can do the following.

UString ustring = font.MapToUnicode(code) if !ustring.Empty(): pdftron::Unicode firstChar = ustring.GetAt(0); if firstChar >= 0xE000 and firstChar <= 0xF8FF: // character is in Private Use Area.