Removing all slanted/rotated from a PDF

Ivanho · April 10, 2012, 1:03am

Q: We have a pdf with a large area of background text that we would like to remove from the document. I’ve been looking at the Word object attempting to find a property that indicates if a word is slanted/rotated, but have been unable to do so. When creating a PDFTron Doc Object from a PDF, how would you go about excluding text like this. I have included an example PDF. Please let me know if you have any other questions.

A:

I modified the ElementReaderTest sample so that it checks for text rotation. It’s two extra lines of code, so you can easily translate it to whichever language you’re using.

def ProcessElements(reader):
element = reader.Next()
while element != None: # Read page contents
if element.GetType() == Element.e_path: # Process path data…
data = element.GetPathData()
points = data.GetPoints()
elif element.GetType() == Element.e_text: # Process text strings…
##################

check for text rotation

mtx = element.GetTextMatrix()
if mtx.m_a != 1 or mtx.m_b != 0 or mtx.m_c != 0 or mtx.m_d != 1:
data = element.GetTextString()
print(data)
#################
elif element.GetType() == Element.e_form: # Process form XObjects
reader.FormBegin()
ProcessElements(reader)
reader.End()
element = reader.Next()

Basically all text in the content stream has a text matrix, and if the first 4 values of the Matrix2D object are not 1,0,0,1 then there is a rotation. See the API for the Matrix2D class if you’d like a more thorough explanation.

Ryan · April 10, 2012, 5:19pm

Just to clarify, the code above prints all text that is either rotated and/or scaled. If you just want to check for rotated text then use this instead (‘a’ and ‘d’ are also used in scaling, so we ignore in this case).

check for text rotation

mtx = element.GetTextMatrix()
if mtx.m_b != 0 or mtx.m_c != 0:
data = element.GetTextString()
print(data)

Finally, if you want to know the actual rotation value, see this forum posting…
https://groups.google.com/forum/?hl=en&fromgroups#!searchin/pdfnet-sdk/rotation/pdfnet-sdk/Mf0IXIrl_24/4ALYT9pHEwIJ