Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv
to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string
is "Non sp\xE9cifi\xE9"
, then utf8_string
will be "Non spécifié"
. Also, Iconv.iconv
can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string
thinks it is in ASCII-8BIT but is really in ISO-8859-1.
Read a csv file containing special characters(different spoken language)
open(csv_aws_url).read.force_encoding('utf-8')
when we import csv data, how eliminate invalid byte sequence in UTF-8
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines
could take in optional options :encoding
which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
Rails parse upload file \xDE from ASCII-8BIT to UTF-8
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT
doesn't assign Unicode-compatible characters to values 128..255
- it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1
("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.
Perl CSV reading characters that aren't there
Your file is not a valid UTF-8 file. Byte E9 appears where it's not expected.
Followed by two continuation bytes = ok
$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\xBF\xBF", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
done
Not followed by two continuation bytes = bad
$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\x41", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
UTF-8 "\xE9" does not map to Unicode at -e line 2.
done
Fix your bad data.
Related Topics
Installing Ruby on MAC Os X 10.8.2
How to Save an Object to a File
Differencebetween 'Range#Include' and 'Range#Cover'
Why Do I Get a Bcrypt-Ruby Gem Install Error
How to Reverse a 'Rails Generate'
Differencebetween Send_Data and Send_File in Ruby on Rails
Setting Up Private Github Access with Aws Elastic Beanstalk and Ruby Container
Rails Forms for Has_Many Through Association with Additional Attributes
How to Make Instance Variables Private in Ruby
How to Reference a Function in Ruby
Testing Error Pages in Rails with Rspec + Capybara
Is There a Built-In Binary-Search in Ruby
Keep Form Fields Filled After an Error (Ror)
Uninstall Old Versions of Ruby Gems
"Which in Ruby": Checking If Program Exists in $Path from Ruby