How do I remove duplicate text during PDF text extraction?

Aaron_Gravesdale · February 15, 2008, 2:46am

Q: I am trying to extract data from a PDF and am evaluating PDFNet for
that purpose. Some of the test files I have seem to have the text in
there twice. So if my document has the text "this is some text", when
I extract all of the text elements, I might get something like this:

thi, s, is, som, e, tex, t, this, is, so, me, tex, t

where the various text elements, if reconstructed using the text
element coordinates, encode the document twice. Have you ever heard of
anything like this? I want to throw out the duplicates. Is there some
built-in api mechanism for this? I wondered if maybe these duplicate
"versions" were on different layers or something and I could just
ignore the duplicate layers.
-----
A: You can use pdftron.PDF.TextExtractor to extract words and text
from PDF pages. TextExtractor has the capability to remove duplicated
text and this option is enabled by default. As a starting point for
your project you may want to take a look at TextExtract sample project
- (http://www.pdftron.com/net/samplecode.html#TextExtract) .

In case you are not getting the expected results, please let us know
(you can send any test files with dummy data to support@pdftron.com)
and we will look into the problem.