Ruby Read CSV File as Utf-8 And/Or Convert Ascii-8Bit Encoding to Utf-8

Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

With newer Rubies, you can do things like this:

utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

Read a csv file containing special characters(different spoken language)

open(csv_aws_url).read.force_encoding('utf-8')

when we import csv data, how eliminate invalid byte sequence in UTF-8

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})

Rails parse upload file \xDE from ASCII-8BIT to UTF-8

You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.

ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.

The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.

I suggest you try something like this:

file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)

This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

Perl CSV reading characters that aren't there

Your file is not a valid UTF-8 file. Byte E9 appears where it's not expected.

Followed by two continuation bytes = ok

$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\xBF\xBF", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
done

Not followed by two continuation bytes = bad

$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\x41", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
UTF-8 "\xE9" does not map to Unicode at -e line 2.
done

Fix your bad data.



Related Topics



Leave a reply



Submit