When We Import CSV Data, How Eliminate "Invalid Byte Sequence in Utf-8"

when we import csv data, how eliminate invalid byte sequence in UTF-8

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})

Rails Import CSV Error: invalid byte sequence in UTF-8

Specify the encoding with encoding option:

CSV.foreach(file.path, headers: true, encoding: 'iso-8859-1:utf-8') do |row|
# your code here
end

Ruby/Rails CSV parsing, invalid byte sequence in UTF-8

You need to tell Ruby that the file is in ISO-8859-1. Change your file open line to this:

file=File.open("input_file", "r:ISO-8859-1")

The second argument tells Ruby to open read only with the encoding ISO-8859-1.

invalid byte sequence for encoding UTF8

If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select "Properties".

But that error seems to be telling you there's some invalid UTF8 data in your source file. That means that the copy utility has detected or guessed that you're feeding it a UTF8 file.

If you're running under some variant of Unix, you can check the encoding (more or less) with the file utility.

$ file yourfilename
yourfilename: UTF-8 Unicode English text

(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.

If you use that same utility on a file that came from Windows systems (that is, a file that's not encoded in UTF8), it will probably show something like this:

$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators

If things stay weird, you might try to convert your input data to a known encoding, to change your client's encoding, or both. (We're really stretching the limits of my knowledge about encodings.)

You can use the iconv utility to change encoding of the input data.

iconv -f original_charset -t utf-8 originalfile > newfile

You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase "To enable automatic character set conversion".

Invalid byte sequence importing CSV created with R to Postgres

From my comment:

write.csv(df, out_file,fileEncoding=TRUE)
# write.csv(df, con)

Either of the above will work. If the encoding option is added to the connection, I don't think it doesn't affect the file itself.

PostgreSQL invalid byte sequence for encoding utf8 0xbf

Your COPY statement is correct, but your data are not in UTF8 encoding.

They are probably in Latin-1 or Windows-1252, where 0xBF is ¿.

Specify the encoding correctly, e.g.:

COPY edmonton.general_filtered (descriptive)
FROM 'D:/property_own/descriptive_details.csv'
(FORMAT 'csv', HEADER, ENCODING 'WIN1252');

invalid byte sequence for encoding “UTF8”

Posted in another thread - use the iconv command to strip these characters out of your file. Greenplum is instantiated using a character set, UTF-8 by default, and requires that all characters be of the designated character set. You can also choose to log these errors with the LOG ERRORS clause of the EXTERNAL TABLE. This will trap the bad data and allow you to continue up to set LIMIT that you specify during create.

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence


Related Topics



Leave a reply



Submit