Q: I am using TextExtractor/pdf2text and it’s output contains garbled, unreadable text. What does that mean?
Q: Is there anything else I can do to extract garbled text?
If you want to recreate a document with correct text information you can integrate PDFNet with any OCR output (e.g. tesseract, abby, etc) by creating ‘Searchable PDF Images’ from scanned PDF.
Q: Can I automatically detect documents with missing unicode mapping?
A: Some PDF files have garbled encoding (built-in or in PDF font dictionary) and other have incorrect ToUnicode mapping. In general, there is no ‘perfect’ solution and trusting either encoding or ToUnicode can be error prone. In v.5.9.2 based on request from some users (asking for text output that is more consistent with Acrobat) we switched to using font encoding first during Unicode mapping. The downside is that some files which PDFNet processed without ‘issues’ were garbled. Since v18.104.22.168 we made further progress and can now extract correct text from even more documents (without running OCR). Unfortunately not all documents can be recovered.
A: If you simply want to extract textual data from the document you can integrate pdf2image tool or PDFDraw class with any OCR solution.
A: Unfortunately there is no simple PDF property that can be checked to identify which files are garbled.