How can I convert all PDF text into Unicode?

Aaron_Gravesdale · January 18, 2008, 9:35pm

Q: Is there a way using the library, to allow for us to convert all
text that
is in a PDF into their Unicode mapped equivalents?
----
A: There are many ways you could implement this functionality using
PDFNet, but probably the simplest approach is using
pdftron.PDF.TextExtractor.

C#/Java pseudocode would look as follows:

PDFDoc doc = new PDFDoc(input_path + "newsletter.pdf");
doc.InitSecurityHandler();

TextExtractor txt = new TextExtractor();
for (PageIterator itr=doc.GetPageIterator(); itr.HasNext();
itr.Next()) {
Page page = itr.Current();
txt.Begin(page); // Read the page.

String page_text = txt.GetAsText()
Console.WriteLine("\n\n- GetAsText --------------------------\n{0}",
page_text);
}

For a concrete sample code, please take a look at TextExtract sample
project:
http://www.pdftron.com/net/samplecode.html#TextExtract