Extracting text from PDF as ASCII instead of Unicode [C++ specific]

Aaron_Gravesdale · May 15, 2009, 12:58am

Q: When trying to extract text from a pdf, PDFNet SDK extracts UNICODE
text.
Which would be ok, if i could just convert it to ASCII later using one
of UString's class functions....

The string converted to ASCII text contains escaped Unicode
characters. The document contains lots of characters in brazilian
portuguese, which seems to be the problem.

Please let me know if I'm doing something wrong and if there is
support to actual ASCII text.

Here is an example of output i'm getting:

DI\U00C1RIO ELETR\U00D4NICO DA JUSTI\U00C7A FEDERAL DA 3\U00AA REGI
\U00C3O Edi\U00E7\U00E3o n\U00BA 41/2009 \U2013 S\U00E3o Paulo

which should be:

DIÁRIO ELETRÔNICO DA JUSTIÇA FEDERAL DA 3ª REGIÃO Edição nº 41/2009 -
São Paulo

One detail:
that unicode string is what was returned from this piece of code

int size = text.ConvertToAscii (NULL, 0, true); char *szText = (char
*) malloc (size + 1); text.ConvertToAscii (szText, size + 1, true);

also tried having it like szText = text.ConvertToAscii(); which got me
the exact same result.
-----
A: The problem is that actual (official) ASCII charset does not
include the extended set (charcodes larger than 127), so PDFNet seems
to be doing the right thing.

You may want to convert the string to UTF8 (which looks like ASCII and
is commonly used on Linux systems) using text.ConvertToUTF8();
Alternatively you can access Unicode characters one by one and take
the lower byte which should represent the extended character set.

For (int i=0; i<str_sz; ++i) my_str =+ char(text.GetAt(i));

Please let me know if this helps.