Using PDFNet to search and replace variable text in PDF (static content and not forms).

Aaron_Gravesdale · January 30, 2008, 10:32pm

Q: I understand that you may be able to help with a PDF problem of
mine. Let me give you a brief description of my process.

Our analysts create reports (pdf documents) once they finish their
research. These documents are technically complete but they are
missing some information that is not availble until it is reviewed by
second person. In the report, I want to create a place-holder for the
information, "{rpt_code}" and "{draft}"

Once reviewed I want to:
* replace the place holder for the "{rpt_code}" with the new report
code;
* replace all occurances of "{Draft}" with a space.
* add a password to protect the docuemnt for view and print only.

The pdfs are generated via Crystal reports export. If you look at my
samples, I will start with the "samplereport0001.pdf" and end up with
"samplereport0001_done.pdf". *(ignore the infix watermark on
done.pdf).
I process 100-200 a day, so I need a way to automate. ex) run a
script from a command line.
----
A: You can implement your solution using PDFNet SDK (www.pdftron.com/
net). The process can be broken down into several steps or stages:

1) First you would search for all occurrences of the placeholder on
the page (e.g. "{rpt_code}" and "{draft}"). There are several ways to
implement this, but probably the simples one is using
pdftron.PDF.TextExtractor as illustrated in TextExtract sample project
(http://www.pdftron.com/net/samplecode.html#TextExtract). This step
would give you the positioning information for each placeholder on the
page (i.e. word bounding box).

2) In the second step you would edit the existing page (e.g. as
illustrated in ElementEdit sample -www.pdftron.com/net/
samplecode.html#ElementEdit). You could use bounding boxes of
placeholders identified in 1) to detect if a given run should be
deleted (i.e. skipped). This steps would essentially remove all
placeholders from the page.

3) Finally you can add new content at the place of old placeholders
(e.g. see www.pdftron.com/net/faq.html#how_watermark). For this step
you would also use the positioning information identified in 1).

The solution can be implemented as a command-line application running
in unattended mode.

Although implementation is conceptually very simple, it can be tricky
to implement so that application can deal with generic (arbitrary) PDF
documents (the reason being that PDFs could come from all kinds of
producers and generators). In case you do need our help with the
implementation, we can also assist your development as part of a
consulting or a custom engineering project.