Invalid Byte Sequence In UTF-8 Ruby
As Arie already answered this error is because invalid byte sequence \xC3
If you are using Ruby 2.1 +, you can also use String#scrub
to replace invalid bytes with given replacement character. Here:
a = "abce\xC3"
# => "abce\xC3"
a.scrub
# => "abce�"
a.scrub.sub("a","A")
# => "Abce�"
Ruby String.encode still gives invalid byte sequence in UTF-8
I'd guess that "\xBF"
already thinks it is encoded in UTF-8 so when you call encode
, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:
>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>
\xBF
isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode
:
encode(dst_encoding, src_encoding [, options] ) → str
[...] The second form returns a copy of
str
transcoded fromsrc_encoding
todst_encoding
.
You can force the issue by telling encode
to ignore what the string thinks its encoding is and treat it as binary data:
>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"
Where s
is the "\xBF"
that thinks it is UTF-8 from above.
You could also use force_encoding
on s
to force it to be binary and then use the two-argument encode
:
>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"
How can I globally ignore invalid byte sequences in UTF-8 strings?
I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).
Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:
s = "Men\xFC".force_encoding('BINARY') # => "Men\xFC"
Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:
s = s.encode("UTF-8", invalid: :replace, undef: :replace) # => "Men\uFFFD"
s.valid_encoding? # => true
Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:
def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end
That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.
Invalid byte sequence in UTF-8 (ArgumentError)
Probably your string is not in UTF-8 format, so use
if ! file_content.valid_encoding?
s = file_content.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
s.gsub(/dr/i,'med')
end
See "Ruby 2.0.0 String#Match ArgumentError: invalid byte sequence in UTF-8".
Ruby 1.9.3 Invalid byte sequence in UTF-8 explanation needed
I have 64 bit Cygwin, Ruby 2.0.0 and gem 2.4.1 and was experiencing the same issue. gem install ...
, gem update
, everything ended with "ERROR: While executing gem ... (ArgumentError) invalid byte sequence in UTF-8".
I had also all locales set to "en_US.UTF-8".
I have read somewhere that it should help to set LANG
to an empty string or "C.BINARY", but it didn't help. But it was good hint to start experimenting.
Finally I have solved that by setting both LANG
and LC_ALL
to an empty string. All other locale environment variables (LC_CTYPE
etc.) was automatically set to "C.UTF-8" by that, LANG
and LC_ALL
remained empty.
Now gem
is finally working.
UPDATE
It seems that specifically LC_CTYPE
is causing that issue if it's set to UTF-8. So setting it to C.BINARY should help. Other locale environment variables can be set to UTF-8 without affecting it.
export LC_CTYPE=C.BINARY
Related Topics
"Errno::Eaccess...Permission Denied" Running Compass Watch
How to Bypass Mass Assignment Protection
Strictly Convert String to Integer (Or Nil)
Differencebetween "Be_True" and "Be True" in Rspec
How to Add Child Nodes in Nodeset Using Nokogiri
Generating a Short Uuid String Using Uuidtools in Rails
Most Efficient Way to Calculate Hamming Distance in Ruby
How to Do Advanced String Comparison in Ruby
Generate Ssh Keypairs (Private/Public) Without Ssh-Keygen
Heroku: No Rakefile Found (But Works Locally)
Model Using Modules in Rails Application
CSV - Unquoted Fields Do Not Allow \R or \N (Line 2)
Using Gets() Gives "No Such File or Directory" Error When I Pass Arguments to My Script
Rails/Rspec: How to Test #Initialize Method
Best Way to Group by Date with Mongoid
How to Convert a Ruby String Range to a Range Object
Convert Array to Hash While Preserving Array Index Values in Ruby