Ruby String.Encode Still Gives "Invalid Byte Sequence in Utf-8"

Invalid Byte Sequence In UTF-8 Ruby

As Arie already answered this error is because invalid byte sequence \xC3

If you are using Ruby 2.1 +, you can also use String#scrub to replace invalid bytes with given replacement character. Here:

a = "abce\xC3"
# => "abce\xC3" 
a.scrub
# => "abce�"
a.scrub.sub("a","A")
# => "Abce�"

Ruby String.encode still gives invalid byte sequence in UTF-8

I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:

>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>

\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:

encode(dst_encoding, src_encoding [, options] ) → str

[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.

You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:

>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"

Where s is the "\xBF" that thinks it is UTF-8 from above.

You could also use force_encoding on s to force it to be binary and then use the two-argument encode:

>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"

How can I globally ignore invalid byte sequences in UTF-8 strings?

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding?  # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
  str = str.force_encoding("UTF-8")
  return str if str.valid_encoding?
  str = str.force_encoding("BINARY")
  str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

Invalid byte sequence in UTF-8 (ArgumentError)

Probably your string is not in UTF-8 format, so use

if ! file_content.valid_encoding?
  s = file_content.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
  s.gsub(/dr/i,'med')
end

See "Ruby 2.0.0 String#Match ArgumentError: invalid byte sequence in UTF-8".

Ruby 1.9.3 Invalid byte sequence in UTF-8 explanation needed

I have 64 bit Cygwin, Ruby 2.0.0 and gem 2.4.1 and was experiencing the same issue. gem install ..., gem update, everything ended with "ERROR: While executing gem ... (ArgumentError) invalid byte sequence in UTF-8".

I had also all locales set to "en_US.UTF-8".

I have read somewhere that it should help to set LANG to an empty string or "C.BINARY", but it didn't help. But it was good hint to start experimenting.

Finally I have solved that by setting both LANG and LC_ALL to an empty string. All other locale environment variables (LC_CTYPE etc.) was automatically set to "C.UTF-8" by that, LANG and LC_ALL remained empty.

Now gem is finally working.

UPDATE

It seems that specifically LC_CTYPE is causing that issue if it's set to UTF-8. So setting it to C.BINARY should help. Other locale environment variables can be set to UTF-8 without affecting it.

export LC_CTYPE=C.BINARY