How to Convert PDF to Excel or CSV in Rails 4

How to convert PDF to Excel or CSV in Rails 4

Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.

I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.

Then I convert the HTML table to CSV.

(This is not ideal but it works)

Here is the code:

require 'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri 'http://localhost:3000'

def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })

File.open('/path/to/save/as/html/response.html', 'w') do |f|
f.puts response
end
end

def convert
f = File.open("/path/to/saved/html/response.html")
doc = Nokogiri::HTML(f)
csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
doc.xpath('//table/tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
end
csv.close
end
end

Now Run it like this:

#> page = PageTextReceiver.new
#> page.run
#> page.convert

It is not refactored. Just proof of concept. You need to consider performance.

I might use the gem Sidekiq to run it in background and move the result to the main thread.

Ruby on Rails: Plugins/gems for converting a .xls file to a .pdf file?

I don't know of anything that will do it without shelling out some cash. You might be able to roll your own by combining the roo gem with the nice swanky pdfkit gem. The roo gem would allow you to read the contents of the excel file. You would then need to construct an html document and pass that to pdfkit which converts html to pdf. That is a little indirect, but should get the job done.

Extracting Tables from PDF files in Ruby

You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.

def extract_to_text(pdf_path)
command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end

def extract_to_html(pdf_path)
command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end

These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.

You can install poppler on a mac with homebrew:

brew install poppler


Related Topics



Leave a reply



Submit