Is there a way in ruby 1.9 to remove invalid byte sequences from strings?
"€foo\xA0".chars.select(&:valid_encoding?).join
How can I globally ignore invalid byte sequences in UTF-8 strings?
I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).
Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:
s = "Men\xFC".force_encoding('BINARY') # => "Men\xFC"
Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:
s = s.encode("UTF-8", invalid: :replace, undef: :replace) # => "Men\uFFFD"
s.valid_encoding? # => true
Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:
def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end
That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.
Remove invalid byte sequence in UTF-8 after an apparently succesfull encoding
The text contains the invalid sequence \xA3
. This represents a pound sign in Latin-1 (ISO-8859-1).
"\xA3".force_encoding('ISO-8859-1').encode('UTF-8')
#=> "£"
The quick fix is to replace invalid byte sequences in body
with String#scrub
, but that will remove them:
"\xA326.97".scrub('')
#=> "26.97"
However, to solve the "real" problem you should look earlier in the pipeline. The supplied charset seems to be wrong. Apparently the message is encoded in Latin-1, although the charset suggests something different. Maybe the problem is on the side of the sender.
ArgumentError (invalid byte sequence in UTF-8): Ruby 1.9.3 render view
For Fixed this, only used gem 'mysql2' and change adapter in my database.yml, and change the encoding
staging:
adapter: mysql2
database: data_basename
username: root
encoding: utf8
How to remove invalid byte sequence?
I figured out the solution to my problem. It turns out it was the encoding of the XML document that I was scraping that was a problem. To fix this, I am now making the encoding option explicit:
doc = Nokogiri::XML::Reader(open(url),nil,'ISO-8859-1')
Before I just had:
doc = Nokogiri::XML::Reader(open(url))
Hope this helps someone.
Remove invalid byte sequence in UTF-8 after an apparently succesfull encoding
The text contains the invalid sequence \xA3
. This represents a pound sign in Latin-1 (ISO-8859-1).
"\xA3".force_encoding('ISO-8859-1').encode('UTF-8')
#=> "£"
The quick fix is to replace invalid byte sequences in body
with String#scrub
, but that will remove them:
"\xA326.97".scrub('')
#=> "26.97"
However, to solve the "real" problem you should look earlier in the pipeline. The supplied charset seems to be wrong. Apparently the message is encoded in Latin-1, although the charset suggests something different. Maybe the problem is on the side of the sender.
Related Topics
Differencebetween 'Try' and '&.' (Safe Navigation Operator) in Ruby
How to Copy File Across Buckets Using Aws-S3 or Aws-Sdk Gem in Ruby on Rails
Which Ruby Version am I Really Running
How to Create a Hash in Ruby That Compares Strings, Ignoring Case
How to Deal with Memory Leaks in Rmagick in Ruby
Difference Between Downcase and Downcase! in Ruby
Why Does Array.Each Behavior Depend on Array.New Syntax
Error Installing Ruby with Rvm (Osx 10.8)
How to Know If Today's Date Is in a Date Range
How to Set Private Instance Variable Used Within a Method Test
How to Get Indexes of All Occurrences of a Pattern in a String
Handling Exceptions Raised in a Ruby Thread
Rails Date Format in Form Field
Ruby: Intersection Between Two Ranges
What Is the Ruby Equivalent of Python's Getattr
How to Track the Execution Process of Ruby Program
Selenium-Webdriver Ruby --> How to Wait for Images to Be Fully Loaded After Click