How do I calculate the width of a space character?

Aaron_Gravesdale · March 20, 2008, 10:41pm

Q: I need to calculate the width of a space character for a given PDF
Font. I get the font using element.GetGState().GetFont(). The problem
is that most fonts seem to be a subset.

a) How can I get the full font object from the document?
b) How can I get the width of a space character? (Need the default
size of a space to figure out of words are contiguous or need to be
split)
c) Is the code bellow the most efficient way to get the correct
dimensions of a text character? I have many cases where the BBox of a
char is completely incorrect.
d) In PDF, how do I even get a space character? In UNICODE/ASCII a
space is 32, what is it in PDF? 3?

// Code to get char bbox in Java

private Rect getGlyphBBox(double x, double y, int charCode,
pdftron.PDF.Font pdfFont,
double fontHorzScale, double fontSize, double
fontAscent, double fontDescent, Matrix2D matrix) {
Rect rect = new Rect();
double dx;

rect.x1 = x;
rect.y1 = y;

if (pdfFont.IsSimple()) {
dx = pdfFont.GetWidth(charCode) / 1000.0;
dx *= fontHorzScale * fontSize;

   rect.x2 = rect.x1 + dx;
   rect.y1 = y + fontDescent;
   rect.y2 = y + fontAscent;
}
else {
   int cid = pdfFont.MapToCID(charCode);
   dx = pdfFont.GetWidth(cid) / 1000.0;
   dx *= fontHorzScale * fontSize;

   rect.x2 = rect.x1 + dx;
   rect.y1 = y + fontDescent;
   rect.y2 = y + fontAscent;
}

rect = getBBoxTransRect(rect, matrix);
return rect;
}

private Rect getBBoxTransRect(Rect rect, Matrix2D matrix) {
   double p1x = rect.x1;
   double p1y = rect.y1;
   double p2x = rect.x2;
   double p2y = rect.y1;
   double p3x = rect.x2;
   double p3y = rect.y2;
   double p4x = rect.x1;
   double p4y = rect.y2;

   matrix.Mult(ref p1x, ref p1y);
   matrix.Mult(ref p2x, ref p2y);
   matrix.Mult(ref p3x, ref p3y);
   matrix.Mult(ref p4x, ref p4y);

   return new Rect(Math.Min(Math.Min(Math.Min(p1x, p2x), p3x), p4x),
                        Math.Min(Math.Min(Math.Min(p1y, p2y), p3y),
p4y),
                        Math.Max(Math.Max(Math.Max(p1x, p2x), p3x),
p4x),
                        Math.Max(Math.Max(Math.Max(p1y, p2y), p3y),
p4y));
}
---------------
A: If the font is embedded you can extract the font data stream using
font.GetEmbeddedFont(). Unfortunately the embedded font may is
typically subsetted and will probably not be very useful.

Your code for computing the bounding box for a character seems to be
correct. What is the matrix that you pass in the call to
getGlyphBBox()? It should be a concatenation of the Current Transform
Matrix (CTM) and current text matrix: element.GetCTM() *
element.GetTextMatrix(). Please keep in mind that this computation
will not work for type 3 fonts. Another approach to obtain a more
precise bounding box is by getting glyph outline then finding the bbox
for the path.

Regarding the white space question, it is a tricky question. Some PDF
documents and fonts do not use space character at all (instead they
are absolutely positioning each text). So relying on the space
character in the font will not work. PDF format can use all kinds of
text encodings. In most cases you can use font.MapToUnicode() to map a
char codes (element.GetCharIterator()) from the page content stream to
Unicode. Another possibly simpler way to get Unicode representation
for the entire text run is using element.GetTextString(). Also, please
keep in mind that Unicode standard contains several other white space
characters (besides code 32).

So there is really no clear cut way to get the size of white space
character in PDF. In practice the calculation may take into account
many values and measures to come up with an estimate. For example, if
available, you may want to take into account values 'AvgWidth',
'MaxWidth' and 'MissingWidth' from the font descriptor. These values
can be obtained as follows:

// In Java
Obj font_descriptor = font.getDescriptor();
if (font_descriptor != 0) {
   double avg_width = 0;
   Obj avg = font_descriptor.findObj("AvgWidth");
   if (avg != null) avg_width = avg.getNumber();
}