Issues with text extraction

Djonatah · March 10, 2011, 7:40pm

Hi,

We developed a tool for text extraction, and it was working very well.
But today we tried to get some text from a PDF document and it's
returning garbage (this also happens if try to copy/paste from the
acrobat reader). It seems the PDF is using a custom encoding. Is there
any other way to retrieve the unicode text?

        if ( element.getType() == pdftron.PDF.Element.e_text ) {
                GState gstate = element.getGState();
                Font font = gstate.getFont();
                String tempResult = "";
                long char_code = 0;
                CharIterator itr = element.getCharIterator();
                while( itr.hasNext() ){
                        CharData data=( CharData )( itr.next() );
                        char_code = data.getCharCode();
                        char[] temp = font.mapToUnicode( char_code );
                        tempResult = tempResult +
String.valueOf( temp );
                }
        }

Thanks