Q: Can you please explain what each of these options means when we
pass them as a parameter to Begin function of TextExtractor class:
e_no_ligature_exp
e_no_dup_remove
e_punct_break
e_remove_hidden_text
e_no_invisible_text
--------
A: The following is the description for these flags from PDFNet SDK
API Reference (http://www.pdftron.com/net/apiref.html).
// Processing options that can be passed in Begin() method to direct
// the flow of content recognition algorithms
enum ProcessingFlags
{
// Disables expanding of ligatures using a predefined mapping.
// Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll,
// ss, fs, st, oe, OE.
e_no_ligature_exp = 1,
// Disables removing duplicated text that is frequently used to
// achieve visual effects of drop shadow and fake bold.
e_no_dup_remove = 2,
// Treat punctuation (e.g. full stop, comma, semicolon, etc.) as
// word break characters.
e_punct_break = 4,
// Enables removal of text that is obscured by images or
// rectangles. Since this option has small performance penalty
// on performance of text extraction, by default it is not
// enabled.
e_remove_hidden_text = 8,
// Enables removing text that uses rendering mode 3 (i.e. invisible
text).
// Invisible text is usually used in 'PDF Searchable Images' (i.e.
scanned
// pages with a corresponding OCR text). As a result, invisible text
// will be extracted by default.
e_no_invisible_text = 16
};