Creating a searchable PDFs from image only PDF.

Aaron_Gravesdale · March 5, 2007, 10:34pm

Q:

We need to take non-searchable (no-text, only image) PDFs and create
Searchable PDfs. We have used the PDFNet SDK to get images of each
page and then OCR for text.

We now need to rebuild the PDF with the original image and underlying
text (from OCR). Is this possible using PDFNet?
-----

A:

Using PDFNet SDK you can also 'rebuild' the PDF (i.e. add hidden text
to original PDF document with scanned images). Basically you would
open the original document and would add text recognized via OCR as
hidden text. The new content can be added either as a top or as a
background layer (usually you would write to top layer with
TextRenderingMode set to e_invisible_text attribute - see below).

You could use ElementWriter.Begin(page) to start writing to new page
and would use ElementBuilder to create new text and graphics. Please
see ElementBuilder sample project (http://www.pdftron.com/net/
samplecode.html#ElementBuilder) for an example of how to use
ElementBuilder and ElementWriter.

The following are some of the relevant FAQ-s:
http://www.pdftron.com/net/faq.html#how_watermark
http://www.pdftron.com/net/faq.html#searchable_images

In order to make invisible text that can be highlighted or searched,
you need to set TextRenderingMode flag in the graphics state of the
text element (i.e. Element. GetGState().
SetTextRenderMode( GState.TextRenderingMode.e_invisible_text ) ).