I’m using text extractor to read text from a PDF. I have some code that uses heuristics to analyze the extracted text and decide which snipits of text must be physically removed (redacted) from the PDF. How do I actually remove arbitrary text from the PDF document as it relates to the text read by the text extractor?
Is there some kind of mapping from the text extractor to the PDF element so that I can remove each particular element corresponding to the extracted text?
No, not in any straight forward way. TextExtractor is separated by layers of logic from the low level text graphic operators and what is returned to the you the client.
For instance, each letter is often its own element.
Finally, PDF page content is a stream, so there no random access to the elements, and there is in the end no identifying information (you could have overlapping text for instance).
Typically though using the bounding box is sufficient. So you can pass the TextExtractor results to Redactor class, to redact the text under that area (though technically speaking its not 1:1).