Including additional subtitute fonts (Extracting text from PDFs with CID fonts)

Martin · August 18, 2011, 8:34pm

Hi.

Another thread [1] talked about how to enable PDFNet to extract text
from PDFs with CID fonts.
The answer included that "You could include additional substitute
fonts as part of your application."

How do I accomplish that?

We use PDFNet for Java in a Linux environment.
Thanks for any pointers!

[1] https://groups.google.com/d/msg/pdfnet-sdk/zS7axyf0dtE/yEc2ax90A3MJ

Aaron_Gravesdale · August 18, 2011, 8:56pm

Hello,

Could you please clarify what you are trying to accomplish. Thank you!

The article (http://groups.google.com/group/pdfnet-sdk/t/
cd2edac727f476d1) talks about how to provide font substitutes when
fonts are not embedded. This is discussion is only related to PDF
rendering (conversion to image) and should not affect text extraction
(which works the same way whether or not the font is embedded).

Martin · August 19, 2011, 6:39am

Thanks for replying and for explaining what I misunderstood about the other thread.

We use the PDFNet library to extract text from PDFs, Now we have encountered a type of PDFs that we cannot extract anything from. My suspicion is that the fonts are the problem, since these PDFs contain CID fonts with the enconding “Identity H”.
I have attached a pdf that PDFNet - with our current setup - fails to extract text from.

Thanks for any comments!

/Martin

text_not_extractable.pdf (23.3 KB)

Aaron_Gravesdale · August 22, 2011, 7:41pm

We looked at the file and it seems that the problem is due to corrupt ToUnicode encoding. Text extraction using Acrobat Pro and other PDF consumers also results in garbled output. There is no simple, generic solution for this type of files. A possible workaround would be to use OCR.

Martin · August 23, 2011, 6:59am

Thanks for the reply, it was very informative. Too bad there is no obvious solution…

I’ll throw out some more facts about how this file was created. Maybe someone can think of a way around the problem.

The file is created within a legacy process/lifecycle that I can control parts of, but not all. It looks like this:

PDF1 is created by the iText library
PDF1 is opened by an end user, using Adobe Reader
From within Adobe Reader, PDF1 is printed to a new file, PDF2, using a virtual printer software.
Finally, text is extracted from this PDF2 using PDFNet.

(Strange process, I know. There are reasons for it, and not only historical ones.)

The bad guy here is Adobe Reader, because I can get the entire process to work if I use another PDF Reader. E.g. Nuance PDF Reader.
However, I cannot get all users to switch readers.

Q: Is there anything that can be done to PDF1 in order to help Adobe Reader print a ToUnicode encoding that PDFNet can handle? I have some control of the creation process (iText).

Thanks in advance for any hints!

/Martin

Aaron_Gravesdale · August 23, 2011, 5:48pm

The issue may be related to virtual printer driver, itext or some combination. It is unlikely caused by Reader. I suggest trying to use different printer driver or PDF generator lib