Converting from PDF to SVG/Text/Image in a Ruby web service

Aaron_Gravesdale · January 26, 2012, 12:14am

Q: We need to be able to extract the text of the page, a PNG representation, a SVG representation and a PDF representation of every page in a given PDF document using Ruby.

I see lots of sample code for various scenarios using Ruby, but the specific PDF2SVG and PDF2Image sections don’t seem to cover Ruby samples.

Also, they need this to read in a file from Amazon S3 and process it in memory. Will that even be possible using code from your PDFDocMemory sample? Perhaps something like this:

PDFNet.Initialize

Read a PDF document in a memory buffer.

file = StdFile.new((url_to_document_on_amazon), StdFile::E_read_mode)

file_sz = file.FileSize

file_reader = FilterReader.new(file)

mem = file_reader.Read(file_sz)

doc = PDFDoc.new(mem, file_sz)

doc.InitSecurityHandler

A:

Read a PDF document in a memory buffer.

file = StdFile.new((url_to_document_on_amazon), >StdFile::E_read_mode)

This most likely won’t work for downloading the data from an online source. At this point you will need to use a Ruby specific API to download the document into the memory buffer, and then use the buffer to create a PDFDoc. One way to implement this is as follows:

Use a technique similar to PDFDocMemoryTest sample (http://www.pdftron.com/pdfnet/samplecode/PDFDocMemoryTest.rb) to create a PDFDoc from a memory buffer.
Call Convert::ToSVG on the document to convert to PDFDoc to SVG. (there is a somewhat less simple sample in http://www.pdftron.com/pdfnet/samplecode/ConvertTest.rb)
Use PDFDraw as in the PDFDraw sample (http://www.pdftron.com/pdfnet/samplecode/PDFDrawTest.rb) to create PNG files for each page. (iterate through the pages as in example 2, but omit the “JPEG” and encoder_param arguments to output PNG)

Aaron_Gravesdale · January 26, 2012, 11:40pm

Q: I can’t seem to get this library to load correctly in a Rails 3.1 app. Do you know what the correct procedure is with Rails? Here is what I did so far:

Placed the PDFNetRuby.so file into the vendor/lib directory (I’m on a mac and the production server is Linux so it looks like I might need to be using libPDFNetC for development. Is that correct, or can I use PDFNetRuby for both?)
I then added the vendor/lib directory to the config.auto_load_paths: config.autoload_paths += %W(#{config.root}/vendor/lib)
I then added a config/initializers file called pdf_net.rb with the require statement: require ‘PDFNetRuby’
When I try to load the console with “rails console” it bombs with: `require’: no such file to load – PDFNetRuby (LoadError)

Is there something I am missing here? I also have the: include PDFNetRuby in my class file for processing the PDF but that will fail too without the require statement working.

A: After extracting PDFNet SDK for Mac, did you read ‘readme.txt’ and run the install script:

sh setup.sh

? Also, were you able to run included samples?

null-002500431691:Samples user$ sh runall_ruby.sh

If, at this point, you get the following

…/…/…/Lib/PDFNetRuby.bundle: dlopen(…/…/…/Lib/PDFNetRuby.bundle, 9):

Library not loaded: /usr/local/lib/libruby.1.9.1.dylib (LoadError)

you need to copy libruby.1.9.1.dylib to /usr/lib/? This file should be included with your RVM Ruby installation. By default, it should be pathToRVM/.rvm/rubies/ruby-1.9.2-p290/lib/libruby.1.9.1.dylib. You might not be able to see this directory in the finder, in this case you can do the copying in the terminal. The following is the command:

sudo cp …/.rvm/rubies/ruby-1.9.2-p290/lib/libruby.1.9.1.dylib /usr/lib/