Using 'pdftron.PDF.TextExtractor' in Ruby to extract text from PDF

Ivanho · April 30, 2012, 7:11pm

Q:
Do you have documentation (i.e. Ruby Doc) for the ruby api? I’m trying to use the TextExtractor and pass the e_no_ligature_exp flag to the Begin method, but it’s unclear how to pass in a Rect from ruby. When I pass in Rect.new, no content is extracted.

When I use the GetAsText method the string returned is encoded as ANSII8BIT. How can I tell the library to return the string as encoded as UTF-8 or UTF-16?

A:

The API for PDFNet’s other language bindings (Ruby/PHP/Python/etc.) is same as for C/C++. As such, you can use the C/C++ documentation found here: http://www.pdftron.com/pdfnet/apiref.html. All method and class names should be identical with each other. Additionally, we provide several examples here: http://www.pdftron.com/pdfnet/samplecode.html to get you more comfortable with our PDFNet Ruby API.

As for the TextExtractor questions, one reason why no content was extracted when you pass Rect.new is that it created a default Rectangle which has 0 width and 0 height. This will not intersect any texts. What you may want to do is to specify the values of x1, y1, x2, & y2 in Rect.new(x1, y1, x2, y2). Please see this sample for more information: http://www.pdftron.com/pdfnet/samplecode/TextExtractTest.rb

Finally, GetAsText indeed returns the encoding of type as ASCII8BIT. When you take a closer look at the character codes, you will notice that they are encoded as UTF-8. As a temporary fix, you can safely change the encoding of this string to UTF-8 by invoking string.encoding method. All our Ruby APIs return strings in UTF-8 character encoding. It will be up to the user to map them correctly to desired encoding (by using string.encoding).