Q: We have a pdf with a large area of background text that we would like to remove from the document. I’ve been looking at the Word object attempting to find a property that indicates if a word is slanted/rotated, but have been unable to do so. When creating a PDFTron Doc Object from a PDF, how would you go about excluding text like this. I have included an example PDF. Please let me know if you have any other questions.
A:
I modified the ElementReaderTest sample so that it checks for text rotation. It’s two extra lines of code, so you can easily translate it to whichever language you’re using.
def ProcessElements(reader):
element = reader.Next()
while element != None: # Read page contents
if element.GetType() == Element.e_path: # Process path data…
data = element.GetPathData()
points = data.GetPoints()
elif element.GetType() == Element.e_text: # Process text strings…
##################
check for text rotation
mtx = element.GetTextMatrix()
if mtx.m_a != 1 or mtx.m_b != 0 or mtx.m_c != 0 or mtx.m_d != 1:
data = element.GetTextString()
print(data)
#################
elif element.GetType() == Element.e_form: # Process form XObjects
reader.FormBegin()
ProcessElements(reader)
reader.End()
element = reader.Next()
Basically all text in the content stream has a text matrix, and if the first 4 values of the Matrix2D object are not 1,0,0,1 then there is a rotation. See the API for the Matrix2D class if you’d like a more thorough explanation.