Trying to print PDF to a virtaul printer for text extraction

Aaron_Gravesdale · October 27, 2011, 1:03am

Q:

We have a problem printing a specific pdf file to the BlackIce printer driver (http://www.blackice.com/Printer%20Drivers/Tiff%20Printer%20Drivers.htm).

It has an option to extract text from the printed document, however the text would come out all garbled or blank.

After some testing we realised we get the same issue printing directly from Acrobat and other viewers.

We also tried another pdf printing software and the way the print seems to get around this issue. They have an option to embed fonts or to use System Fonts. From my limited understanding or pdf structure I was under the impression that this means the fonts used in the PDF document can be stored in the PDF or in the case of using system fonts it will always try to look for the fonts in the local system.

Currently our printing code looks like this which comes mostly from your PDFPrint sample

private void PrintPdf(MemoryStream pdfFS, int rawKind, PrinterItem pi)

{

_pdfdoc = new PDFDoc(pdfFS);

_pdfdoc.InitSecurityHandler();

PrinterMode printerMode = new PrinterMode();

printerMode.SetAutoCenter(true);

printerMode.SetAutoRotate(true);

printerMode.SetCollation(true);

printerMode.SetCopyCount(1);

// printerMode.SetUseRleImageCompression(true);

// printerMode.SetDPI(300); // regardless of ordering, an explicit DPI setting overrides the OutputQuality setting

printerMode.SetDuplexing(PrinterMode.DuplexMode.e_Duplex_Auto);

printerMode.SetNUp(PrinterMode.NUp.e_NUp_1_1, PrinterMode.NUpPageOrder.e_PageOrder_LeftToRightThenTopToBottom);

printerMode.SetOrientation(PrinterMode.Orientation.e_Orientation_Portrait);

printerMode.SetOutputAnnot(PrinterMode.PrintContentTypes.e_PrintContent_DocumentAndAnnotations);

// If the XPS print path is being used, then the printer spooler file will

// ignore the grayscale option and be in full color

printerMode.SetOutputColor(PrinterMode.OutputColor.e_OutputColor_Color);

printerMode.SetOutputPageBorder(false);

printerMode.SetOutputQuality(PrinterMode.OutputQuality.e_OutputQuality_High);

printerMode.SetPaperSize(PrinterMode.PaperSize.e_a4);

PageSet pagesToPrint = new PageSet(1, _pdfdoc.GetPageCount(), PageSet.Filter.e_all);

Print.StartPrintJob(_pdfdoc, _Factory.PrintConfigSetting.printerName, _pdfdoc.GetFileName(), “”, pagesToPrint, printerMode);

}

A: Is there a specific way why you are going down this convoluted way to extract text from PDF? A far sampler and more reliable option would be to use ‘pdftron.PDF.TextExtractor’ as shown in TextExtractor sample (http://www.pdftron.com/pdfnet/samplecode.html#TextExtractor).

Printing a PDF file via printer driver is not guaranteed to maintain Unicode encoding. Printing PDF with system fonts may preserve Unicode encoding but will result in incorrect print output for many/most documents.