How to convert PDF to Excel or CSV in Rails 4
Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.
I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.
Then I convert the HTML table to CSV.
(This is not ideal but it works)
Here is the code:
require 'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri 'http://localhost:3000'
def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })
File.open('/path/to/save/as/html/response.html', 'w') do |f|
f.puts response
end
end
def convert
f = File.open("/path/to/saved/html/response.html")
doc = Nokogiri::HTML(f)
csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
doc.xpath('//table/tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
end
csv.close
end
end
Now Run it like this:
#> page = PageTextReceiver.new
#> page.run
#> page.convert
It is not refactored. Just proof of concept. You need to consider performance.
I might use the gem Sidekiq
to run it in background and move the result to the main thread.
Ruby on Rails: Plugins/gems for converting a .xls file to a .pdf file?
I don't know of anything that will do it without shelling out some cash. You might be able to roll your own by combining the roo gem with the nice swanky pdfkit gem. The roo gem would allow you to read the contents of the excel file. You would then need to construct an html document and pass that to pdfkit which converts html to pdf. That is a little indirect, but should get the job done.
Extracting Tables from PDF files in Ruby
You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.
def extract_to_text(pdf_path)
command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end
def extract_to_html(pdf_path)
command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
`#{command}`
end
These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.
You can install poppler on a mac with homebrew:
brew install poppler
Related Topics
Rails 3 Caching: Expire Action for Named Route
Rails 5 - Using Polymorphic Associations - Rendering the Views
How to Pass Content to Jekyll Default Converter After Custom Conversion
Shortening Socket Timeout Using Timeout::Timeout(N) Does Not Seem to Work for Me
Get All Keys in Hash with Same Value
Can't Find Gem Railties (>= 0.A) with Executable Rails (Gem::Gemnotfoundexception)
Convert Ip Address to 32 Bit Integer in Ruby
How to Assign a Has_Many/Belongs_To Relation Properly in Rails Activerecord
Ruby Defining Operator Procedure
Match Regex with Numeric Value and Decimal
How to Set Up the Recipient Id in Public Activity
What's the Best Way to Return an Enumerator::Lazy When Your Class Doesn't Define #Each
More Concise Version of Max/Min Without the Block
Ruby Regex to Capture Everything Between Two Strings (Inclusive)