macros and embedded content in pdf-to-pdf conversion

Pavi_De_Alwis · October 7, 2014, 12:48pm

If using the SDK as below to convert from pdf-to-pdf would the original document's embedded macros/files (eg. javascript) be copied to the resulting pdf too ?

      pdfdoc = PDFDoc.new()
      Convert.ToPdf(pdfdoc, input_file_path)
      pdfdoc.Save(output_file_path, SDFDoc::E_compatibility)

Ryan · October 7, 2014, 6:09pm

Yes, PDFNet should not be removing those sorts of entries, and they will be in the output. Compatibility mainly restricts file compression.

Pavi_De_Alwis · October 9, 2014, 6:20am

Is it possible to exclude these entries from the output? In particular the JS code ?

Ivanho · October 9, 2014, 6:07pm

You could run PDF/A Converter (pdftron.PDF. PDFA.PDFACompliance – as shown in PDF/A sample https://www.pdftron.com/pdfnet/samplecode.html#PDFA). PDF/A conversion will automatically remove JavaScript and embedded files (for pdf/a 1 & 2 compliance).

Alternatively you would need to write code that would strip away JavaScript, Embedded Files etc.

Obj names = doc.GetRoot().FindObj(“Names”)

if (names != null)

{

names->Erase(“EmbeddedFiles”);

names->Erase(“JavaScript”);

}

You would also need to traverse all actions in the doc (e.g. associated with Annotations) removing any JavaScript actions etc.

Depending on full list of requirements this may or may not be simple. Since PDF/A is designed to take care of these things, PDFACompliance would be the simplest to use.

Ivanho · October 9, 2014, 6:11pm

Btw. some of our clients who do not consider PDF/A ‘secure enough’ or ‘good enough’ for archiving used the following approach to convert generic PDF to raster PDF (i.e. PDF images):

https://groups.google.com/d/msg/pdfnet-sdk/5eYUsT6BPNQ/XwNri3BQkxMJ

It is also possible to make rasterized PDF searchable by using TextExtractor and adding hidden text on top of images.