How to Remove 4 Byte Utf-8 Characters in Ruby

How to remove 4 byte utf-8 characters in Ruby?

The following seems to work for me in Ruby 1.9.3:

input.each_char.select{|c| c.bytes.count < 4 }.join('')

For example:

input = "hello \xF0\xA9\xB6\x98 world"                  # includes U+29D98
input.each_char.select{|c| c.bytes.count < 4 }.join('') # 'hello world'

How do I remove non UTF-8 characters from a String?

We have a few problems.

The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)

Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.

Here's your string of bytes as I see it:

>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]

And it does contain bytes that are not valid UTF-8.

>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false

We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:

>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"

Remove from a string the characters where bytesize is greater than 2 with Ruby

Elaborating on OP's efforts, not using regular expressions:

string = "hèllö>●!"

cleaned = string.each_char.with_object("") do |char, str|
str << char unless char.bytesize > 2
end

p cleaned

How can I globally ignore invalid byte sequences in UTF-8 strings?

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding? # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

Ruby: Limiting a UTF-8 string by byte-length

For Rails >= 3.0 you have ActiveSupport::Multibyte::Chars limit method.

From API docs:

- (Object) limit(limit) 

Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.

Example:

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

"€foo\xA0".chars.select(&:valid_encoding?).join

Ruby: Remove invisible characters after converting string to UTF-8

Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.

This seems to work for me:

require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')

In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.

Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:

require 'nokogiri'
require 'open-uri'

URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert  
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space

# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end

# Clean Up Text
text.gsub!(/\s+/, ' ')

puts text

There's now just a single, normal space between the period at the end of 15 and the number 16:

15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.

In Ruby, how to UTF-8 encode this weird character?

I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.

fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}

str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')

The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.



Related Topics



Leave a reply



Submit