How to Convert Character Encoding with Ruby 1.9

how to convert character encoding with ruby 1.9

As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding on the string before you do any processing:

s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"

Set UTF-8 as default for Ruby 1.9.3

To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:

# encoding: utf-8

It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.

In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.

As for encoding defaults:

  • Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
  • Ruby 1.9: default string encoding is US_ASCII everywhere.
  • Ruby 2.0 and above: default string encoding is UTF-8.

Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.

Ruby 1.9.x replace sets of characters with specific cleaned up characters in a string

I'll make it easy for you to implement

#encoding: UTF-8
t = 'ŠšÐŽžÀÁÂÃÄAÆAÇÈÉÊËÌÎÑNÒOÓOÔOÕOÖOØOUÚUUÜUÝYÞBßSàaáaâäaaæaçcèéêëìîðñòóôõöùûýýþÿƒ'
fallback = {
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f'
}

p t.encode('us-ascii', :fallback => fallback)

ruby 1.9, force_encoding, but check

(update: see https://github.com/jrochkind/scrub_rb)

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

Character encoding with Ruby 1.9.3 and the mail gem

After playing a bit, I found this:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part

You can extract the charset from the message like so.

message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset

Be careful with non-multipart, as the following can cause trouble:

body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)

Batch convert to UTF8 using Ruby

Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.

UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.

It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.

Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Can I set the default string encoding on Ruby 1.9?

Don't confuse file encoding with string encoding

The purpose of the #encoding statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.

To define the encoding in all your files at once, you can use the
magic_encoding gem
, it can insert uft-8 magic comment to all ruby files in your app.

The error you're getting at runtime Encoding::CompatibilityError is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.

This most likely happens when:

  • you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)

  • the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII). force_encoding will help there. There's also Encoding::primary_encoding in Rails 1.9 to set the default encoding for new Strings.
    And there is config.encoding in Rails in the config/application.rb file.

  • String which come from your database, and then are combined with other Strings in your view.
    (their encodings could be either way around, and incompatible).

Side-Note: Make sure to specify a default encoding when you create your database!

    create database yourproject  DEFAULT CHARACTER SET utf8;

If you want to use EMOJIs in your strings:

    create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;

and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin

The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.

Please check this Yehuda Katz article, which covers this in-depth, and explains it very well:
(there is specifically a section 'Incompatible Encodings')

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

http://yehudakatz.com/2010/05/17/encodings-unabridged/

and:

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

http://graysoftinc.com/character-encodings

Character encoding with Ruby 1.9.3 and the mail gem

After playing a bit, I found this:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part

You can extract the charset from the message like so.

message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset

Be careful with non-multipart, as the following can cause trouble:

body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)

Ruby String encoding changed over versions

If you look at the differences in documentation between 2.0 and 2.1 documentation you will see the following text disappeared in 2.1:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

So this behaviour where 2.0 and lower did not modify the string when the source and target encodings where the same, and 2.1+ does, appears to be an intended change.

I'm not 100% sure what your code is trying to do, but if it's trying to clean up the string from invalid UTF-8 byte sequence, you can use valid_encoding? and scrub as of Ruby 2.1:

irb(main):055:0* content = "Is your pl\xFFace available?"
=> "Is your pl\xFFace available?"
irb(main):056:0> content.valid_encoding?
=> false
irb(main):057:0> new = content.scrub
=> "Is your pl�ace available?"
irb(main):059:0> new.valid_encoding?
=> true

EDIT:

If you look through the 2.0 source code, you will see the str_transcode0 function exits immediately if senc (source encode) is the same as denc (destination encode):

    if (senc && senc == denc) {
return NIL_P(arg2) ? -1 : dencidx;
}

In 2.1 it scrubs the data when the encodings are the same and you explicitly asked to replace invalid sequences:

    if (senc && senc == denc) {
...
if ((ecflags & ECONV_INVALID_MASK) && explicitly_invalid_replace) {
dest = rb_str_scrub(str, rep);
}
...
}


Related Topics



Leave a reply



Submit