Implementing a custom PDF to XML converter [obtaining positioning info].

Aaron_Gravesdale · January 7, 2009, 2:13am

Q: I would like to implement a custom PDF to XML converter based on
PDFNet SDK (http://www.pdftron.com/net). Something similar to
PDF2SVG
(http://www.pdftron.com/pdf2svg/).

I have two questions:
- I am currently trying to work out x/y/width/height/rotation of a
text block, how would I go about doing that?
- On e_quadto pathsegment, I couldnt find in the documentation which
point is the end point (x1/y1 or x2/y2 or x3/y3)
----
A: You could use PDFNet to implement a custom PDF to XML converter
(along the lines or ElementReaderAdv or TextExtract sample). PDF2SVG
itself is built using PDFNet content extraction API.

The position of a text element is defined using the Current
Transformation Matrix (CTM) and Text Matrix (element.GetCTM() *
element.GetTextMatrix()). You could also use CharIterator (as shown
in
ElementeaderAdv) to obtain the positioning information (in text
space)
for each character in the text element.

You can alternatively call element.GetBBox(ref rect) to obtain the
bonding-box for a given element (in PDF user space) but this will not
give you rotation, shear factor, etc (which can be obtained from the
transformation matrix).

The positioning of other elements on the page (e.g. images, paths,
form XObjects) is completely defined with the CTM (element.GetCTM
();you can also use element.GetBBox() to obtain the bounding boxes
for
these elements).

- On e_quadto pathsegment, I couldn't find in the documentation which point is the end point (x1/y1 or x2/y2 or x3/y3)

The last point in the segment is always the last point in segment
(xy,y3). The other points are control points for the curve (PDF
format
only supports Bezier/cubic curves).