How do I prevent a user from searching text or extracting text from PDF?

Aaron_Gravesdale · October 8, 2008, 12:48am

Q: What is the best way to disable text searching within a PDF ?
i.e. some way so that the user cannot access the text search facility
or ensure that results are not found?
----
A: Unfortunately standard PDF security handler does not have an option
to disable text search feature in a PDF viewer. The closest option
(pdftron.PDF.SecurityHandler.Permission.e_extract_content) will
disable content extraction (including copy & paste), but will not
disable text search.

One possible way to prevent text search in PDF documents is using a
custom security handler and a custom PDF viewer. Essentially you would
encrypt all PDFs and store some custom data and permissions in the
document. The main disadvantage of this approach is that your users
would need to use a custom PDF viewer (e.g. based on PDFView -
http://www.pdftron.com/net/samplecode.html#PDFView) instead of
Acrobat.

Another possibility is to convert PDF text to paths (or raster image -
using pdftron.PDF.PDFDraw).

Yet another approach is to scramble encoding in embedded PDF fonts so
that PDF text extraction tools (such as pdftron.PDF.TextExtractor)
will return garbage. The main disadvantage of this approach is that it
requires certain expertise with PDF fonts.

Aaron_Gravesdale · November 21, 2008, 1:21am

Q: We are still pursuing the issue of turning off text-searching
within PDF files.

This was the question and answer we received from your team some time
back.

‘What is the best way to disable text searching within a PDF?’
Forum reference (in
http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/96e5909b28510d3c/89b9bdeb1a08b5cf?lnk=gst&q=pdf+text+search#89b9bdeb1a08b5cf
):

Well, we have successfully used rasterizing the pages as images. I
was impressed with the ease with which PDFNet let us do this!
Unfortunately to have the output in reasonable quality (DPI) makes the
files too large, and we really want to retain the excellent text
quality under magnification that we already have.

So we are once again investigating the font encoding options as
referred to in your answer. I would really be grateful for some
pointers as to how to make this work.
If we scramble the font – are we not scrambling the text as well? In
that case if we substitute the codes for the fonts, would we have to
make the same changes to the text? Would the spacing on proportional
fonts then be a problem? Some tips or links on this area would be very
helpful!!

A: You would need to scramble the font encoding. You would also need
to replace text data with re-encoded text as well. The PDF fill still
look exactly the same as before however text extraction or copy &
paste would result in junk text. The only way to obtain text from this
type of scrambled PDFs is to perform an OCR on the rasterized pages.

In that case if we substitute the codes for the fonts, would we have to make the same changes to the text?

Correct, you would most likely need to update text as well. This can
be implemented along the lines of EditText sample project (http://
www.pdftron.com/net/samplecode.html#EditText)

One approach to implement this font scrambler would be to extract
glyph outlines for each referenced glyph using
pdftron.PDF.Font.GetGlyphPath(). This outline can be used to construct
a new font (with scrambled encoding). Probably the simplest approach
would be to dynamically build a Type 3 (i.e. a PDF) font using PDFNet
API (i.e. using ElementBuilder, ElementWriter, and SDF API). The other
option would be to rebuild a TTF or Type1 font but this is probably
much more work.