Set Utf-8 as Default for Ruby 1.9.3

Set UTF-8 as default for Ruby 1.9.3

To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:

# encoding: utf-8

It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.

In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.

As for encoding defaults:

  • Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
  • Ruby 1.9: default string encoding is US_ASCII everywhere.
  • Ruby 2.0 and above: default string encoding is UTF-8.

Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.

Can I set the default string encoding on Ruby 1.9?

Don't confuse file encoding with string encoding

The purpose of the #encoding statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.

To define the encoding in all your files at once, you can use the
magic_encoding gem
, it can insert uft-8 magic comment to all ruby files in your app.

The error you're getting at runtime Encoding::CompatibilityError is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.

This most likely happens when:

  • you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)

  • the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII). force_encoding will help there. There's also Encoding::primary_encoding in Rails 1.9 to set the default encoding for new Strings.
    And there is config.encoding in Rails in the config/application.rb file.

  • String which come from your database, and then are combined with other Strings in your view.
    (their encodings could be either way around, and incompatible).

Side-Note: Make sure to specify a default encoding when you create your database!

    create database yourproject  DEFAULT CHARACTER SET utf8;

If you want to use EMOJIs in your strings:

    create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;

and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin

The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.

Please check this Yehuda Katz article, which covers this in-depth, and explains it very well:
(there is specifically a section 'Incompatible Encodings')

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

http://yehudakatz.com/2010/05/17/encodings-unabridged/

and:

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

http://graysoftinc.com/character-encodings

Ruby 1.9.3 Invalid byte sequence in UTF-8 explanation needed

I have 64 bit Cygwin, Ruby 2.0.0 and gem 2.4.1 and was experiencing the same issue. gem install ..., gem update, everything ended with "ERROR: While executing gem ... (ArgumentError) invalid byte sequence in UTF-8".

I had also all locales set to "en_US.UTF-8".

I have read somewhere that it should help to set LANG to an empty string or "C.BINARY", but it didn't help. But it was good hint to start experimenting.

Finally I have solved that by setting both LANG and LC_ALL to an empty string. All other locale environment variables (LC_CTYPE etc.) was automatically set to "C.UTF-8" by that, LANG and LC_ALL remained empty.

Now gem is finally working.



UPDATE

It seems that specifically LC_CTYPE is causing that issue if it's set to UTF-8. So setting it to C.BINARY should help. Other locale environment variables can be set to UTF-8 without affecting it.

export LC_CTYPE=C.BINARY

Determine character encoding in Ruby 1.9.3

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1") # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"

Encoding error with Rails 2.3 on Ruby 1.9.3

I finally figured out what my issue was. While my databases were encoded with utf8, the app with the original mysql gem was injecting latin1 text into the utf8 tables.

What threw me off was that the output from the mysql comand line client looked correct. It is important to verify that your terminal, the database fields and the MySQL client are all running in utf8.

MySQL's client runs in latin1 by default. You can discover what it is running in by issuing this query:

show variables like 'char%';

If setup properly for utf8 you should see:

+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

If these don't look correct, make sure the following is set in the [client] section of your my.cnf config file:

default-character-set = utf8

Add add the following to the [mysqld] section:

# use utf8 by default
character-set-server=utf8
collation-server=utf8_general_ci

Make sure to restart the mysql daemon before relaunching the client and then verify.

NOTE: This doesn't change the charset or collation of existing databases, just ensures that any new databases created will default into utf8 and that the client will display in utf8.

After I did this I saw characters in the mysql client that matched what I was getting from the mysql2 gem. I was also able to verify that this content was latin1 by switching to "encoding: latin1" temporarily in my database.conf.

One extremely handy query to find issues is using char length to find the rows with multi-byte characters:

SELECT id, name FROM items WHERE LENGTH(name) != CHAR_LENGTH(name);

There are a lot of scripts out there to convert latin1 contents to utf8, but what worked best for me was dumping all of the databases as latin1 and stuffing the contents back in as utf8:

mysqldump -u root -p --opt --default-character-set=latin1 --skip-set-charset  DBNAME > DBNAME.sql

mysql -u root -p --default-character-set=utf8 DBNAME < DBNAME.sql

I backed up my primary db first, then dumped into a test database and verified like crazy before rolling over to the corrected DB.

My understanding is that MySQL's translation can leave some things to be desired with certain more complex characters but since most of my multibyte chars are fairly common things (accent marks, quotes, etc), this worked great for me.

Some resources that proved invaluable in sorting all of this out:

  • Derek Sivers guide on transforming MySQL data latin1 in utf8 -> utf8
  • Blue Box article on MySQL character set hell
  • Simple table conversion instructions on Stack Overlow

Ruby 1.9.3 Why does \x03 .force_encoding( UTF-8 ) get \u0003 ,but \x03 .force_encoding( UTF-16 ) gets \x03

Because "\x03" is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.

That's why "\x03" can be treated as unicode \u0003 in UTF-8 but not in UTF-16.

To represent "\u0003" in UTF-16, you have to use two byte, either 00 03 or 03 00, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be

FE FF 00 03

For the little-endian, the byte sequence should be

FF FE 03 00

The byte order mark should appear at the beginning of a string, or at the beginning of a file.

Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes.

If you see "\u0003", that doesn't mean you got a String which is represented in two bytes 00 03, but some byte(s) that represents the Unicode code point 0003 under the specific encoding as carried in that String. It may be:

03              //tagged as UTF-8
FE FF 00 03 //tagged as UTF-16
FF FE 03 00 //tagged as UTF-16
03 //tagged as GBK
03 //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32

UTF-8 encoding not work with gets method in Ruby

Setting the file encoding using the "magic" comment on top of the file only specifies the encoding of your source code in the file (that is: the encoding of string literals created directly from the parser in your code).

Ruby knows two other default encodings:

  • the external encoding - this specifies the default encoding of data read from external sources (such as the console, opened files, network sockets, ...)
  • the internal encoding - data read from external sources will be transformed into the default internal encoding after reading to ensure you can use compatible encodings everywhere (this is not used by default, the external encoding is thus preserved).

In your case, you have not set the external encoding. On Windows and with Ruby before version 3.0, Ruby assumes the local console encoding of your Windows installation here (such as cp850 in Western Europe).

When Ruby reads your String, it assumes it to be in cp850 encoding (or whatever your default encoding is) while you likely provide utf-8 encoded data. As spoon as you start to operate on this incorrectly encoded data, you will get errors similar to the one you have seen there.

Thus, to be able to correctly read data you need to either provide it with an encoding matching your shell encoding, or you need to tell Ruby which encoding it should assume there.

If you are providing UTF-8 encoded data, you can set the expected encoding using the -E switch when invoking ruby, e.g.:

ruby -E utf-8 your_program.rb

You can also set this in an environment variable of your Windows shell using

set RUBYOPT=-Eutf-8

In Ruby 3.0, the default external encoding on Windows was changed so that it now defaults to UTF-8 on Windows, similar to other platforms. See https://bugs.ruby-lang.org/issues/16604 for details.



Related Topics



Leave a reply



Submit