The relationship between char_code in CharIterator and Unicode values

Question:

We’ve been using pdftron to read a PDF file and generate a Windows
Metafile with great success! Pdftron has been very helpful, enabling
us to replace a tired old Delphi program that never quite worked
right. Kudos.
However, we’ve got a bit of a problem. Perhaps I’m not using pdftron
right, or perhaps there’s a bug in pdftron. Most likely the former.
Here’s what I’m seeing.

The attached PDF repeats the same text in several distinct typefaces.
The code retrieves the typeface name via
pdftron::PDF::Font::GetEmbeddedFontName(). Sometimes this returns the
empty string. Here’s where it gets interesting.
string faceName = font.GetEmbeddedFontName();
if (faceName.empty())
{
faceName = font.GetName();
if (faceName.find(“Symbol”)!= string::npos)
{
bool gotcha = true;
}
}
When GetEmbeddedFontName() returns the empty string and when GetName()
returns a string that includes “Symbol” (or “Wingdings”) I see
character substitution. Instead of getting expected characters, I get
different ones coming from the character iterator.

  for ( pdftron::PDF::CharIterator i = element.GetCharIterator();

i.HasNext(); i.Next() )
{
charCode = i.Current().char_code;

The attached PDF was created by Chad Werdon, a CSi Quality Engineer.
He repeats the same text in distinct typefaces, yet when I examine the
pdftron::PDF::CharIterator::char_code in the context excerpted above,
I get the following:
38 “&” (expected “K”)
75 68 “KD” (expected “ha”)
71 3 "G


" (expected “d ”)
58 “:” (expected “G”)
72 “H” (expected “e”)
85 “U” (expected “r”)
71 82 “GR” (expected “do”)
81 “Q” (expected “n”)
It appears that a monoalphabetic substitution has been applied to the
values returned from char_code, an offset of 29.

Can you explain what is going on? Is my analysis essentially correct?
Is there an offset that is occasionally in play that must occasionally
be applied to char_code? What are the criteria that bring the offset
in play? Where is the offset value, in this case 29, stored?


Answer:

The char_code in CharIterator does need to have any relation to its
ascii or Unicode value.

In case you need to obtain Unicode value for the given char_code you
can use font.MapToUnicode(char_code,…) (i.e.
element.GetGState().GetFont().MapToUnicode(itr.char_code,…)).

Intenrally GetTextString() is also using MapToUnicode to translate
char_codes in the current text run to a Unicode string (if possible).