Reading Type0 font values

arunsrinivaasrs · February 25, 2015, 8:38am

Hi,

My font is SansSerifBold - Type 0 font. From pdftron API, isee the following . char_code := for SimpleFonts char_code := char_data[0], for composite fonts char_code is the numeric value of data stored in char_data buffer.

for ( pdftron::PDF::CharIterator itr = element.GetCharIterator(); itr.HasNext(); itr.Next() ) {

…
…
unsigned char * data = itr.Current().char_data; // char_code is the numeric value of data stored in char_data buffer.
}

I see that the data is NULL. When i take the value from char_code, the value is incorrect (char_code = itr.Current().char_code;) .

Please advice how i can handle Type0 fonts.

Thanks!

Ryan · February 26, 2015, 11:46pm

Are the number of bytes greater than 1? that is itr.Current().bytes > 1?

If so, then just the first byte is zero.

Or it could be a malformed PDF, where there is char code zero. One should be the lowest possible char code.

arunsrinivaasrs · March 3, 2015, 8:35am

char_code = itr.Current().char_code; . Yes the number of bytes is greater than 1. First byte is 0 and The value from second byte does not correspond the actual character. For Example, The character code that i get is 43(character ‘+’) which is not in PDF.

Attaching the PDF .

T-01 TITLE SHEET3.pdf (66.8 KB)

arunsrinivaasrs · March 4, 2015, 7:53am

In Short i would want to know how i handle Multibyte characters.

const unsigned char* pmbcs = itr.Current().char_data;

MultiByteToWideChar(CP_ACP, MB_COMPOSITE, ((LPCSTR) pmbcs), 2, buffer, 4);

Ryan · March 5, 2015, 1:19am

Hi, we need to clarify exactly what is in the PDF, so you understand that what you are seeing is expected.

In the PDF the first text content is the following.

<002b0030003600270034003000230036002b003100300023002e> Tj

The numbers between the ‘<’ and ‘>’ is hex encoded binary data, where each pair is a byte. However, since the binary data is to be used a text string, the binary is read in 2 byte elements to give a char code between 0-255. So the first char code is <002b> or 43. So you are getting the correct output from the iterator, 2 bytes, and char_data is [0, 43]

I think the other confusion though, is the charcode is an arbitrary number (that usually matches the unicode, but that is not required). The Font maps the char code to the glyph, which in this case is the glyph for upper case latin i. That is of course correct, otherwise you would see garbage graphics.

However, if you open the PDF in any PDF viewer (Acrobat, Foxit, etc.) select and copy/paste the text into another app such as Notepad, you will get garbage text. This is because the font unicode map is bad.

So the confusion comes from the fact that the char codes do not match the rendered glyph or unicode value.

If you want to explain what your objective is (why are you parsing the char codes), we can assist in you solving your need.