Is There a Way in Ruby 1.9 to Remove Invalid Byte Sequences from Strings

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?


"€foo\xA0".chars.select(&:valid_encoding?).join

How can I globally ignore invalid byte sequences in UTF-8 strings?

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding? # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

Remove invalid byte sequence in UTF-8 after an apparently succesfull encoding

The text contains the invalid sequence \xA3. This represents a pound sign in Latin-1 (ISO-8859-1).

"\xA3".force_encoding('ISO-8859-1').encode('UTF-8')
#=> "£"

The quick fix is to replace invalid byte sequences in body with String#scrub, but that will remove them:

"\xA326.97".scrub('')
#=> "26.97"

However, to solve the "real" problem you should look earlier in the pipeline. The supplied charset seems to be wrong. Apparently the message is encoded in Latin-1, although the charset suggests something different. Maybe the problem is on the side of the sender.

ArgumentError (invalid byte sequence in UTF-8): Ruby 1.9.3 render view

For Fixed this, only used gem 'mysql2' and change adapter in my database.yml, and change the encoding

staging:

adapter: mysql2

database: data_basename

username: root

encoding: utf8

How to remove invalid byte sequence?

I figured out the solution to my problem. It turns out it was the encoding of the XML document that I was scraping that was a problem. To fix this, I am now making the encoding option explicit:

doc = Nokogiri::XML::Reader(open(url),nil,'ISO-8859-1')

Before I just had:

doc = Nokogiri::XML::Reader(open(url))

Hope this helps someone.

Remove invalid byte sequence in UTF-8 after an apparently succesfull encoding

The text contains the invalid sequence \xA3. This represents a pound sign in Latin-1 (ISO-8859-1).

"\xA3".force_encoding('ISO-8859-1').encode('UTF-8')
#=> "£"

The quick fix is to replace invalid byte sequences in body with String#scrub, but that will remove them:

"\xA326.97".scrub('')
#=> "26.97"

However, to solve the "real" problem you should look earlier in the pipeline. The supplied charset seems to be wrong. Apparently the message is encoded in Latin-1, although the charset suggests something different. Maybe the problem is on the side of the sender.



Related Topics



Leave a reply



Submit