Q: I am trying to extract data from a PDF and am evaluating PDFNet for
that purpose. Some of the test files I have seem to have the text in
there twice. So if my document has the text "this is some text", when
I extract all of the text elements, I might get something like this:
thi, s, is, som, e, tex, t, this, is, so, me, tex, t
where the various text elements, if reconstructed using the text
element coordinates, encode the document twice. Have you ever heard of
anything like this? I want to throw out the duplicates. Is there some
built-in api mechanism for this? I wondered if maybe these duplicate
"versions" were on different layers or something and I could just
ignore the duplicate layers.
A: You can use pdftron.PDF.TextExtractor to extract words and text
from PDF pages. TextExtractor has the capability to remove duplicated
text and this option is enabled by default. As a starting point for
your project you may want to take a look at TextExtract sample project
- (http://www.pdftron.com/net/samplecode.html#TextExtract) .
In case you are not getting the expected results, please let us know
(you can send any test files with dummy data to email@example.com)
and we will look into the problem.