Descriptions for options controlling PDF text extraction in 'TextExtractor'.

Q: Can you please explain what each of these options means when we
pass them as a parameter to Begin function of TextExtractor class:

e_no_ligature_exp
e_no_dup_remove
e_punct_break
e_remove_hidden_text
e_no_invisible_text
--------
A: The following is the description for these flags from PDFNet SDK
API Reference (http://www.pdftron.com/net/apiref.html).

// Processing options that can be passed in Begin() method to direct
// the flow of content recognition algorithms
enum ProcessingFlags
{
  // Disables expanding of ligatures using a predefined mapping.
  // Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll,
  // ss, fs, st, oe, OE.
  e_no_ligature_exp = 1,

  // Disables removing duplicated text that is frequently used to
  // achieve visual effects of drop shadow and fake bold.
  e_no_dup_remove = 2,

  // Treat punctuation (e.g. full stop, comma, semicolon, etc.) as
  // word break characters.
  e_punct_break = 4,

  // Enables removal of text that is obscured by images or
  // rectangles. Since this option has small performance penalty
  // on performance of text extraction, by default it is not
  // enabled.
  e_remove_hidden_text = 8,

  // Enables removing text that uses rendering mode 3 (i.e. invisible
text).
  // Invisible text is usually used in 'PDF Searchable Images' (i.e.
scanned
  // pages with a corresponding OCR text). As a result, invisible text
  // will be extracted by default.
  e_no_invisible_text = 16
};