Save contents of text element in python

Spencer_Rathbun · November 15, 2011, 8:06pm

All,

I'm trying to read the contents of a pdf and extract the text. I need to save certain pieces of it for later. Going off the ElementReaderTest.py example, I've got a while loop across all the pages, which processes the elements and returns a list with all strings found:

Elif mytype == Element.e_text:
text.append(element.GetTextString())

Where text was declared as a list earlier. This would work fine, except the list comes back empty. As far as I can determine, each new element is destroying the string I'm being returned. Python treats it as a regular string just fine, so I thought that using copy or deepcopy would give me a separate version, that wouldn't get wiped. But, it doesn't work. All I get out is a bunch of empty lists.

Any ideas?

Spencer Rathbun
IT/Programming
L & D Mail Masters, Inc.
110 Security Parkway
New Albany, IN 47150
Phone: 812.981.7161
Fax: 812.981.7169
srathbun@ldmailmasters.com

Aaron_Gravesdale · November 22, 2011, 2:08am

This is the correnct behavior (as described in the API Ref). ElementReader.Next() clears the previous element. This means that you should copy any elements or strings. Before going into details or alternative solutions is there a specific reason why you are using ElementReader instead of ‘pdftron.PDF.TextExtractor’ (as shown in http://www.pdftron.com/pdfnet/samplecode.html#TextExtract)? ElementReader is low-level and is not the right choice if you are simply looking to extract text from PDF.

Aaron_Gravesdale · November 22, 2011, 2:23am

Q: Ok, I tried using using TextExtractor with Python 2.x bindings to extract text from PDF(http://www.pdftron.com/pdfnet/samplecode.html#TextExtract).

The TextExtractor text retrieval methods as Word.GetString () are supposed to return Unicode according to the APIref. With the python bindings it seems we get python str type objects.We currently do a unicode(word.GetString(), ‘UTF-8’) in python but I am not sure if this is reliable

A: Althought in C++ API reference Word::GetString returns “const Unicode*” in Python 2 binding these are mapped to Python “str” objects. Those “str” objects can be used directly. You do not need to apply “unicode”, “encode”, “decode”, etc, in order to retrieve the text.

Aaron_Gravesdale · November 25, 2011, 12:07am

Q: Unicode is not just a built-in function in python2 it is also a type!

Python 2 has two kinds of string types:

str objects are 8-bit strings
unicode objects are sequences of Unicode code points.

In python 3 this was changed:

str objects are sequences of Unicode code points.
bytes replaces the old str object

This major change is one of the reasons python 3 has still a low adoption.

The majority of production systems are on the 2.x python branch…

To come back to our problem. By giving us python 2 str objects via the python 2 bindings we get 8-bit strings without knowing their encoding. We need to output text to device specific byte encodings. Having an unicode string as suggested by PDFnet API would make this as easy as:

myunicodestring.encode(‘UTF-16’)

We cannot do this with byte strings where we don´t know the encoding…

For that reason we currently decode the byte strings we get from the PDFNet SDK to unicode with the built-in function:

unicode(stringfromsdk, ‘UTF-8’)

But this assumes the PDFNet API always gives us UTF-8 encoded byte stings which we are not sure about. If PDFNet SDK always returns those byte stings in one specific encoding and you can tell us which encoding that is, we can work with that also…

Allow me another question related to this issue with font.GetName().

Is there a way to programmatically find out what encoding the returned byte string has?

Decodeing it with ‘utf-8’ for the attached example document fails with UnicodeDecodeError…

A:

In Python2 all Unicode strings (UStrings) are mapped to UTF8 encoded str-s.

In Python3 all Unicode strings (UStrings) are mapped to UTF16 encoded str-s.

With regards to font.GetName(), PDF spec does not indicate encoding so the function returns a null-terminated string (char*) instead of Unicode string. In this case you can’t assume UTF8 encoding, but could treat the string as ASCII.