Create a text index to search through large amount of PDF documents

Ivanho · October 22, 2014, 7:29pm

Q:

We are using PdfTron sdk to create a Windows 8.1 store app.

We need to search through large amount of pdf documents for some text and return the documents that contains this text. We would also like to get the number of occurrences of the text inside each pdf.

For being able to do this fast we think we need to build a index from the pdf contents. Pdfnet sdk homepage says that it supports "PDF Text extraction and indexing ". I also checked the API documentation for TextExtractor class which claims can be used to create a index from large pdf contents. But there is no indication of how one can do it. I also ran the TextExtract sample project but could not find anything about indexing,

Could you please give some directions/examples of how one can create a searchable index from pdf contents? Or is there a way to search quickly without indexing?

A:

The TextExtractor class can retrieve the text from a PDF, including the information required to populate a full-text indexing engine like Lucene . We do not provide an indexing engine itself, since there are many full-text indexing solutions out there and we want to ensure our users to have the flexibility to the best indexer for their needs (and budget).

Once the text of each page is indexed, you can use the engine to retrieve page hits for a given text search query. If you were to then run a (relatively quick) singe-page text search using TextSearch, it would give you positioning information for each matching search result on the page so that you could draw highlights over matching text.

Since you are in .NET world you could use http://lucenenet.apache.org/, https://github.com/mausch/SolrNet, Microsoft SQL server (http://technet.microsoft.com/en-us/library/cc879306(v=sql.110).aspx)), etc.