Rails 3, Check CSV File Encoding Before Import

Rails 3, check CSV file encoding before import

You can use Charlock Holmes, a character encoding detecting library for Ruby.


To use it, you just read the file, and use the detect method.

contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

You can also convert the encoding to UTF-8 if it is not in the correct format:

utf8_encoded_content = CharlockHolmes::Converter.convert contents, detection[:encoding], 'UTF-8'

This saves users from having to do it themselves before uploading it again.

Before Action on Import from CSV

You don't need a before action.

You need a pre-prossessor, well actually you need to pre-prossess yourself.

Your CSV comes with columns. Column 0, 1, 2, 3 etc (since you don't use headers).

So, for your text columns, let's call them for the sake of the example columns 1, 3, 5.

def self.import(file)
text_cols=[1,3,5] #for example
SmarterCSV.process(file.path) do |row|
text_cols.each do |column|

Or simply, for your particular case:

def self.import(file)
SmarterCSV.process(file.path) do |row|

Ruby/Rails CSV parsing, invalid byte sequence in UTF-8

You need to tell Ruby that the file is in ISO-8859-1. Change your file open line to this:

file=File.open("input_file", "r:ISO-8859-1")

The second argument tells Ruby to open read only with the encoding ISO-8859-1.

CSV importing in Rails - invalid byte sequence in UTF-8 with non-english characters

Solved it with a different approach, this is a much easier solution for importing CSV files into a Rails 3 model than using an external gem:

    require 'csv'
CSV.foreach('doc/socios_full.csv') do |row|
record = Associate.new(
:media_format => row[0],
:group => row[0],
:member => row[1],
:family_relationship_code => row[2],
:family_relationship_description => row[3],
:last_name => row[4],
:names => row[5],

It works flawlessly, even with non-english characters (just tried a 75k import file!). Hope it's helpful for someone.

How to read data from a CSV file of two possible encodings?

Once you know what encoding your file has, you can pass inside the CSV options i.e.

external_encoding: Encoding::ISO_8859_15, 
internal_encoding: Encoding::UTF_8

(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).

So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.

Related Topics

Leave a reply