Using TextExtractor.GetAsXml vs other API to extract text from PDF

Aaron_Gravesdale · May 8, 2009, 5:16pm

Q: I solved the issue regarding the highlights that was not appearing.
It was a localization issue.
I'm extracting the text from the Pdf file using the
TextExtractor.GetAsXml method (http://www.pdftron.com/pdfnet/
samplecode.html#TextExtract). This return xml document in this format:

<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 345.827, 487.559" media_box="0, 0,
345.827,
487.559" rotate="0">
<Flow id="1">
<Para id="0">
<Line box="36.8504, 391.192, 70.4119, 9.2126"
style="font-family:MyriadPro-Regular; font-size:9.2126; color:
#000000;">
<Word box="36.8504, 391.192, 34.8973, 9.2126">DIGITALE</Word>
<Word box="73.7008, 391.192, 33.5615, 9.2126">CAMERA</Word>
</Line>
</Para>
</Flow>
<Flow id="2">
<Para id="0">
<Line box="36.8504, 307.333, 77.04, 15"
style="font-family:MyriadPro-Regular; font-size:15; color: #000000;">
<Word box="36.8504, 307.333, 77.04, 15">Handleiding</Word>
</Line>
</Para>
</Flow>
<Flow id="3">
<Para id="0">
<Line box="297.796, 312.022, 16.2049, 15"
style="font-family:MyriadPro-Regular; font-size:15.2314; color:
#FFFFFF;">
<Word box="297.796, 312.022, 16.2049, 15">NL</Word>
</Line>
</Para>
</Flow>
<Flow id="4">
<Para id="1">
<Line box="150.236, 102.492, 153.65, 5"
style="font-family:MyriadPro-Regular; font-size:5; color: #000000;">
<Word box="150.236, 102.492, 1.41, 5">•</Word>
<Word box="155.906, 102.492, 6.535, 5">Wij</Word>
<Word box="163.501, 102.492, 15.63, 5">danken</Word>

As you can see from here, the Word element contains an attribute named
box which contains the bounding boxe's upper left coordinates, the
width and the heigth:

<Word box="163.501, 102.492, 15.63, 5">.

To be able to wrk with them I had to parse the double coordinates. The
issue appeared when running the program on a amchine with regional
settings different of English(Dutch for example). There, the decimal
separator for
doubles is "," and not ".". This is normal, but it took me quite a
long time to figure out why the highlights did not appear. They were
in the document's highlights summary, but they were not visible
because the
coordinates were parsed incorrectlly. All the highlights were aoutside
the visible area of page 1.
I would suggest something, if I may, in a next version could you
create amethod overload for the TextExtractor.GetAsXml which would
return the xml in the following format?

<Word X="163.501" Y= "102.492" Width="15.63" Height="5">danken</Word>
- The
decimal separator should be the one from the Culture of the process in
which
the library is loaded.("." For English, "," for Dutch and so on).

This would be nice because it frees the developers from the burden of
taking care of localisation retaled issues. Please let me know what
you think about this.
--------
A: I am not sure why you are parsing the XML generated using GetAsXml
(). Instead of this you can iterate through lines and words
programmatically (as shown in TextExtract sample) and access all
required parameters. This is probably much simpler than parsing XML
and having to deal with locale issues etc.

Aaron_Gravesdale · May 8, 2009, 5:17pm

Thanks for the suggestions. It worked.