Ruby 'split': Invalid Byte Sequence in Utf-8 (Argumenterror)

Invalid byte sequence in UTF-8 (ArgumentError)

Probably your string is not in UTF-8 format, so use

if ! file_content.valid_encoding?
s = file_content.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
s.gsub(/dr/i,'med')
end

See "Ruby 2.0.0 String#Match ArgumentError: invalid byte sequence in UTF-8".

Invalid Byte Sequence In UTF-8 Ruby

As Arie already answered this error is because invalid byte sequence \xC3

If you are using Ruby 2.1 +, you can also use String#scrub to replace invalid bytes with given replacement character. Here:

a = "abce\xC3"
# => "abce\xC3"
a.scrub
# => "abce�"
a.scrub.sub("a","A")
# => "Abce�"

`scan': invalid byte sequence in UTF-8 (ArgumentError)

The linked text file contains the following line:

Character set encoding: ISO-8859-1

If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>

Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>

Ruby Invalid Byte Sequence in UTF-8

The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.

ArgumentError invalid byte sequence in UTF-8

You get these errors because the Zip gem assumes the filenames to be encoded in UTF-8 but they are actually in a different encoding.

To fix the error, you first have to find the correct encoding. Let's re-create the string from its bytes:

bytes = [111, 117, 116, 112, 117, 116, 50, 48, 50, 48, 49,
50, 48, 55, 95, 49, 52, 49, 54, 48, 50, 47, 87,
78, 83, 95, 85, 80, 151, 112, 131, 102, 129, 91,
131, 94, 46, 116, 120, 116]

string = bytes.pack('c*')
#=> "output20201207_141602/WNS_UP\x97p\x83f\x81[\x83^.txt"

We can now traverse the Encoding.list and select those that return the expected result:

Encoding.list.select do |enc|
s = string.encode('UTF-8', enc) rescue next
s.end_with?('WNS_UP用データ.txt')
end
#=> [
# #<Encoding:Windows-31J>,
# #<Encoding:Shift_JIS>,
# #<Encoding:SJIS-DoCoMo>,
# #<Encoding:SJIS-KDDI>,
# #<Encoding:SJIS-SoftBank>
# ]

All of the above encodings result in the correct output.

Back to your code, you could use:

path = entry.name.encode('UTF-8', 'Windows-31J')
#=> "output20201207_141602/WNS_UP用データ.txt"

ext = File.extname(path)
#=> ".txt"

file_name = File.basename(path)
#=> "WNS_UP用データ.txt"

The Zip gem also has an option to set an explicit encoding for non-ASCII file names. You might want to give it a try by setting Zip.force_entry_names_encoding = 'Windows-31J' (haven't tried it)

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

I am trying to get this solution working. I have seen people doing

   .encode!('UTF-8', 'UTF-8', :invalid => :replace)

but it doesnt appear to work with File.readlines.

File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.

could you please provide an example to the alternative above.

require 'csv'

CSV.foreach("log.csv", encoding: "utf-8") do |row|
md = row[0].match /watch\?v=/
puts row[0], row[1], row[3] if md
end

Or,

CSV.foreach("log.csv", 'rb:utf-8') do |row|

If you need more speed, use the fastercsv gem.

This seems to have worked for me.

File.readlines('log.csv', :encoding => 'ISO-8859-1')

Yes, in order to read a file you have to know its encoding.



Related Topics



Leave a reply



Submit