I am trying to improve our Pdf text selection method. When I use acrobat viewer and select columnized text the layout is better preserved when pasting into word etc than what I get with GetSelection().GetAsUnicode()…
Is there an example of using GetAsHtml() somewhere? Or any suggestions on preserving some semblance of the original layout?
The PDF standard does not define how text is extracted exactly, so each vendor is left to their own design. Some vendors may handle a particular file “better” than others, and vice-versa, but where “better” may be very subjective, and different people may read the same PDF in different reading orders (e.g. magazine/newspaper).
For more advanced column detection please see our PDFGenie tool.