PDFTron can not precisely separate the PDF text and graphic layer, please help

Ryan · November 5, 2013, 7:11pm

Q:
I would like to separate the text , graphic layer of a PDF file. That is , from a single PDF input file, do some processing , and output two PDF file , one for text layer only, while the other is graphic layer only. The problem is the text layer is not precisely extracted . You may refer to the attachment , one is source file , and the other two are the output. Notice that the tiltle in text layer is different from the original PDF , (there is too much white shading?)

A:

I have good news for you, if you download the latest version of PDFNet, there is a new class called Flattener. If you set this to run in Simple mode, it will automatically do this for you.

Flattener flattener; flattener.Process(pdfdoc, Flattener::e_simple);

From this point, if you run ElementReader on the document, you should only get 1 image per page, and all other elements are text elements.

Note that some text will get flattened by default, if flattener determines that it needs to do this to produce something that looks like the original. You can disable this though by calling

flattener.SetThreshold(Flattener::e_keep_all);