A question regarding PDF text editing

Aaron_Gravesdale · July 24, 2008, 12:33am

Q: We have encountered a unusual problem when replacing text in pdf
file. We are using it to take the address details from a .pdf file
and insert them into another .pdf file. It all seems to work ok,
except when words have an uppercase ‘R’ (eg. ‘Road’).

It does not write the uppercase ‘R’ to the newly created page. It
does write a lowercase ‘r’.

We have also noticed it does not write the follow characters we need
to accurately record details, Uppercase 'Q', 'R', 'V', 'X', 'Z', and
lowercase 'z'.

Below is an example of the code used.

string elementText = "rRRRRRRRRRRRr";
byte[] elementByte =
System.Text.ASCIIEncoding.ASCII.GetBytes(elementText);
int tmpByteCount =
System.Text.ASCIIEncoding.ASCII.GetByteCount(elementText);
string tmpString =
System.Text.ASCIIEncoding.ASCII.GetString(elementByte);
element.SetTextData(elementByte, tmpByteCount);

Where element is declared as ‘Element’ eg(Element element);

This writes ‘rr’ on the new page.

This is within a block of code that is similar to the pdftron example
‘EditTextTest.cs’

I’ve included the full block just so you can see. But the main part
is above and hard coded for test purposes.
------
A: The problem is that the font that is associated with a given text
element may use custom encoding or may be missing glyphs (in case of
subsetted fonts). Usually it is much simpler to use a new font and to
replace the entire text run then edit the existing element. For more
information you may want to refer to the following articles:

  http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/8394e67025f558d9
  http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/74b709f5a426d348
  http://groups.google.com/group/pdfnet-sdk/browse_thread/thread/7b51b00ac6699609

or search for "replace text", "edit text", "search replace" etc.

Aaron_Gravesdale · July 30, 2008, 8:05pm

Q: Thanks for the reply. What we have done is explicitly set the font
for each element as advised. That has worked.

There are some other issues that we are dealing with that you could
hopefully help us with.

1. When we read using reader.Next() the elements returned are single
character with pdfs produced by some programs but with pdfs produced
by others the elements are multi character. Is there a way of
controlling this so that the elements are single character only.

2. We have a string that is a page number that needs to be right
aligned. It is in the source document. Is there a way of setting the
element to be written so that the anchor point is on the right hand
side rather than the left?
------
A: reader.Next() returns PDF elements as they are stored in PDF. By
definition, a text run may refer to more than one glyph. You can
however iterate over characters in a text run using
element.GetCharIterator(). For example:

for (CharIterator itr = element.GetCharIterator(); itr.HasNext();
itr.Next()) {
  char_code = itr.Current().char_code;
  double char_pos_x = itr.Current().x;
  double char_pos_y = itr.Current().y;
}

Most novice users of PDFNet prefer to use element.GetTextString(),
element.GetBBox(), etc because they are more easy to use. For example,
char_code in CharIterator is not mapped to Unicode so you could need
to call element.GetGState().GetFont().MapToUnicode(char_code...) in
order to map the char_code to Unicode. Also the positioning
information in CharIterator is in the text coordinate space (to get
the positioning information in the PDF page coordinate systems the
point would need to be transformed using element.GetTextMatrix ()
followed by element.GetCTM()).

Is there a way of setting the element to be written so that the
anchor point is on the right hand side rather than the left

You can align/justify the text by taking into account the length of
the text run (returned by element.GetTextLength()). For example:

In C#
writer.WriteElement(element_builder.CreateTextBegin(Font.Create(doc,
font), font_size));
Element element = element_builder.CreateTextRun("My text");

// To right justify
element.SetTextMatrix(1, 0, 0, 1, box_width - element.GetTextLength(),
pos_Y);

// To 'center' justify
// element.SetTextMatrix(1, 0, 0, 1, (box_width -
element.GetTextLength())/2, pos_Y);

writer.WriteElement(element);
writer.WriteElement (element_builder.CreateTextEnd())

Aaron_Gravesdale · July 30, 2008, 9:00pm

Q: We use PDFNet SDK for JAVA (on Linux) to edit text in existing PDFs
and generate finish PDFs. We have a problem when we try to insert
pound sterling sign and other special characters($, bullets etc). For
example if we want use pound sterling character as a result we see
"Ã‚Â£" instead "£" in generated PDFs.

We only found one solution from this situation follow next steps:
1.Use CreateCIDTrueTypeFont method to create CID Font;
2.Use createUnicodeTextRun method to insert text with special
characters in PDF.

But this solution can work only on Windows box and also have some
limits.
The problems of this approach are:
1.CreateCIDTrueTypeFont method have as parameter - font path.
According to this we can not use embedded in PDF fonts and must store
a big external font library for all cases.

2. According to documentation it seems that CreateCIDTrueTypeFont
method is available only on Windows. We use PDFNet with JAVA on Linux.
----
A: What is the char-code that you use to represent sterling (£)?
According to PDF documentation (Appendix D1) the char code should be
163 (0xA3). So if you use this char-code with standard PDF fonts (e.g.
Font.Create()) and elementBuilder.CreateTextRun() you should get the
correct output.

// Assumting C++ (a similar code works in JAVA, C#, VB.NET...)
Element element = eb.CreateTextBegin(Font::Create(doc,
Font::e_times_roman), 12);
writer.WriteElement(element);

element = eb.CreateTextRun("Hello \xA3 World!");
writer.WriteElement(element);

Font::CreateTrueTypeFont() & Font::CreateCIDTrueTypeFont2() are
available on all supported platforms (i.e. Windows, Linux, Mac, etc).

---
The only methods available only on Windows are
Font::CreateTrueTypeFont2()/CreateCIDTrueTypeFont2(). These are
platfrom specific utility methods.

must store a big external font library for all cases.

It depends on how may fonts you would like to ship with you
application.
Most applications search for installed system fonts instead of
shipping their own fonts.