Q: I need to calculate the width of a space character for a given PDF

Font. I get the font using element.GetGState().GetFont(). The problem

is that most fonts seem to be a subset.

a) How can I get the full font object from the document?

b) How can I get the width of a space character? (Need the default

size of a space to figure out of words are contiguous or need to be

split)

c) Is the code bellow the most efficient way to get the correct

dimensions of a text character? I have many cases where the BBox of a

char is completely incorrect.

d) In PDF, how do I even get a space character? In UNICODE/ASCII a

space is 32, what is it in PDF? 3?

// Code to get char bbox in Java

private Rect getGlyphBBox(double x, double y, int charCode,

pdftron.PDF.Font pdfFont,

double fontHorzScale, double fontSize, double

fontAscent, double fontDescent, Matrix2D matrix) {

Rect rect = new Rect();

double dx;

rect.x1 = x;

rect.y1 = y;

if (pdfFont.IsSimple()) {

dx = pdfFont.GetWidth(charCode) / 1000.0;

dx *= fontHorzScale * fontSize;

rect.x2 = rect.x1 + dx;

rect.y1 = y + fontDescent;

rect.y2 = y + fontAscent;

}

else {

int cid = pdfFont.MapToCID(charCode);

dx = pdfFont.GetWidth(cid) / 1000.0;

dx *= fontHorzScale * fontSize;

rect.x2 = rect.x1 + dx;

rect.y1 = y + fontDescent;

rect.y2 = y + fontAscent;

}

rect = getBBoxTransRect(rect, matrix);

return rect;

}

private Rect getBBoxTransRect(Rect rect, Matrix2D matrix) {

double p1x = rect.x1;

double p1y = rect.y1;

double p2x = rect.x2;

double p2y = rect.y1;

double p3x = rect.x2;

double p3y = rect.y2;

double p4x = rect.x1;

double p4y = rect.y2;

matrix.Mult(ref p1x, ref p1y);

matrix.Mult(ref p2x, ref p2y);

matrix.Mult(ref p3x, ref p3y);

matrix.Mult(ref p4x, ref p4y);

return new Rect(Math.Min(Math.Min(Math.Min(p1x, p2x), p3x), p4x),

Math.Min(Math.Min(Math.Min(p1y, p2y), p3y),

p4y),

Math.Max(Math.Max(Math.Max(p1x, p2x), p3x),

p4x),

Math.Max(Math.Max(Math.Max(p1y, p2y), p3y),

p4y));

}

---------------

A: If the font is embedded you can extract the font data stream using

font.GetEmbeddedFont(). Unfortunately the embedded font may is

typically subsetted and will probably not be very useful.

Your code for computing the bounding box for a character seems to be

correct. What is the matrix that you pass in the call to

getGlyphBBox()? It should be a concatenation of the Current Transform

Matrix (CTM) and current text matrix: element.GetCTM() *

element.GetTextMatrix(). Please keep in mind that this computation

will not work for type 3 fonts. Another approach to obtain a more

precise bounding box is by getting glyph outline then finding the bbox

for the path.

Regarding the white space question, it is a tricky question. Some PDF

documents and fonts do not use space character at all (instead they

are absolutely positioning each text). So relying on the space

character in the font will not work. PDF format can use all kinds of

text encodings. In most cases you can use font.MapToUnicode() to map a

char codes (element.GetCharIterator()) from the page content stream to

Unicode. Another possibly simpler way to get Unicode representation

for the entire text run is using element.GetTextString(). Also, please

keep in mind that Unicode standard contains several other white space

characters (besides code 32).

So there is really no clear cut way to get the size of white space

character in PDF. In practice the calculation may take into account

many values and measures to come up with an estimate. For example, if

available, you may want to take into account values 'AvgWidth',

'MaxWidth' and 'MissingWidth' from the font descriptor. These values

can be obtained as follows:

// In Java

Obj font_descriptor = font.getDescriptor();

if (font_descriptor != 0) {

double avg_width = 0;

Obj avg = font_descriptor.findObj("AvgWidth");

if (avg != null) avg_width = avg.getNumber();

}