Not getting the expected text from some annotations when extracting text

Ryan · May 17, 2018, 9:57pm

Question:

I am extracting text from FreeText annotations, and there are some that return a different value, then what I see on screen.

How do I get the text that I see on screen?

Answer:

Unfortunately, the PDF specification for FreeText fonts actually has two entries that contain the “contents”. Actually, there is a third location, which would be the optional appearance stream, which if present is definitely what you see on screen.

There is a Contents entry, which is the contents, and then a RC entry (Rich Content) that supports a subset of HTML. Ideally they are kept synchronized, but this is not enforced/guaranteed. Furthermore, the appearance stream (AP) could have a third value, though it should reflect either Content or RC, but again not enforced/guaranteed.

What you can do is the following to get the RC entry, if present.
SDF.Obj rc_obj = annot.GetSDFObj().FindObj("RC"); if(rc_obj != null && rc_obj.IsString()) { string rc_str = rc_obj.GetAsPDFText(); // strip out all HTML syntax, to get raw text. See this post [https://stackoverflow.com/a/5870471/3761687](https://stackoverflow.com/a/5870471/3761687) // now you can compare rc_str to string from contents if you like, and pick one, or always pick RC "if present". }