TextExtractor to Element

kenneth.cruz · July 22, 2021, 11:34am

Product: PDFNet

Product Version: 8.1

Please give a brief summary of your issue:
When using TextExtractor object to read Lines then read each Word, is it possible to get the Element(s) object of that Word object? I just don’t know what class methods to call to get to an element from textextractor

shakthi124 · July 22, 2021, 8:54pm

Hi Kenneth,

To get a better understanding of your requirements, can I ask why you are looking to find the underlying element in the words that you are retrieving from TextExtracor? What sort of information are you looking to retrieve from the elements?

kenneth.cruz · July 22, 2021, 10:09pm

Thanks. I wanna get access to the marked content tied to that element.

shakthi124 · July 26, 2021, 8:05pm

Thank you for your response. In that case, the best way to do this is to use the the ElementReader class to traverse through the elements and find the specific text using the bounding boxes returned from the TextExtractor class.

For reference, you can also take a look at the logical structures sample if you haven’t already.

kenneth.cruz · July 26, 2021, 10:26pm

I was thinking about that. But the challenge is, iterating using the ElementReader class does not match the sequence of text returned by the TextExtractor class. Our use case is we use TextExtractor to read the text, then ideally, read also their marked content following the same sequence returned from the TextExtractor.