How to Replace the Unicode Gem on Ruby 1.9

How to replace the Unicode gem on Ruby 1.9?

Update: a better option may be to use the gem unicode_utils that was created specifically for these missing features:

require "unicode_utils"
UnicodeUtils.nfkd("áéíóúç").gsub(/[^\x00-\x7F]/,'').to_s
#=> "aeiouc"

Is there a possibility you can depend on Rails' ActiveSupport? Then you can do the following:

require "activesupport"
mb_str = ActiveSupport::Multibyte::Chars.new("áéíóúç")
mb_str.normalize(:kd).gsub(/[^\x00-\x7F]/,'').to_s
#=> "aeiouc"

ActiveSupport::Multibyte was written to bring UTF-8/Unicode support to Ruby 1.8, but works fine in 1.9 too. You may be able to borrow some of the code if you don't want it as an external dependency.

Ruby 1.9.x replace sets of characters with specific cleaned up characters in a string

I'll make it easy for you to implement

#encoding: UTF-8
t = 'ŠšÐŽžÀÁÂÃÄAÆAÇÈÉÊËÌÎÑNÒOÓOÔOÕOÖOØOUÚUUÜUÝYÞBßSàaáaâäaaæaçcèéêëìîðñòóôõöùûýýþÿƒ'
fallback = {
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f'
}

p t.encode('us-ascii', :fallback => fallback)

Unicode characters in Ruby 1.9.3 IRB with RVM

RVM has issues with readline installed via homebrew. This gist worked perfectly for me:

$ rvm get latest
$ rvm pkg install readline
$ rvm install 1.9.3 --with-readline-dir=$rvm_path/usr

Instead of install you can use reinstall.

how to convert character encoding with ruby 1.9

As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding on the string before you do any processing:

s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
=> "foo\x92bar"
2.1.0dev :002 > x.valid_encoding?
=> false
2.1.0dev :003 > y = x.scrub
=> "foo�bar"
2.1.0dev :004 > y.valid_encoding?
=> true

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo�bar"
2.1.0dev :006 > x.valid_encoding?
=> true

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

Ruby 1.9 doesn't support Unicode normalization yet

If you are aware of the consequences, i.e. accented characters will not be transliterated in Ruby 1.9.1 + Rails 2.3.x, place this in config/initializers to silence the warning:

# http://stackoverflow.com/questions/2135247/ruby-1-9-doesnt-support-unicode-normalization-yet
module ActiveSupport
module Inflector
# Calling String#parameterize prints a warning under Ruby 1.9,
# even if the data in the string doesn't need transliterating.
if Rails.version =~ /^2\.3/
undef_method :transliterate
def transliterate(string)
string.dup
end
end
end
end

Rails 3 does indeed solve this issue, so a more future-proof solution would be to migrate towards that.

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

Case conversion is locale dependent and doesn't always round-trip, which is why Ruby 1.9 doesn't cover it (see here and here)

The unicode-util gem should address your needs.

Ruby convert IDN domain from Punycode to Unicode

Try the simpleidn gem. It works with Ruby 1.8.7 and 1.9.2.

Edit your Gemfile:

gem 'simpleidn'

then you can enter the command as follows:

SimpleIDN.to_unicode("xn--mllerriis-l8a.com")
=> "møllerriis.com"

SimpleIDN.to_ascii("møllerriis.com")
=> "xn--mllerriis-l8a.com"


Related Topics



Leave a reply



Submit