Set UTF-8 as default for Ruby 1.9.3
To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:
# encoding: utf-8
It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.
In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.
As for encoding defaults:
- Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
- Ruby 1.9: default string encoding is
US_ASCII
everywhere. - Ruby 2.0 and above: default string encoding is
UTF-8
.
Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.
Can I set the default string encoding on Ruby 1.9?
Don't confuse file encoding with string encoding
The purpose of the #encoding
statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.
To define the encoding in all your files at once, you can use the
magic_encoding gem, it can insert uft-8 magic comment to all ruby files in your app.
The error you're getting at runtime Encoding::CompatibilityError
is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.
This most likely happens when:
you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)
the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII).
force_encoding
will help there. There's alsoEncoding::primary_encoding
in Rails 1.9 to set the default encoding for new Strings.
And there isconfig.encoding
in Rails in the config/application.rb file.String which come from your database, and then are combined with other Strings in your view.
(their encodings could be either way around, and incompatible).
Side-Note: Make sure to specify a default encoding when you create your database!
create database yourproject DEFAULT CHARACTER SET utf8;
If you want to use EMOJIs in your strings:
create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;
and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin
The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.
Please check this Yehuda Katz article, which covers this in-depth, and explains it very well:
(there is specifically a section 'Incompatible Encodings')
http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
http://yehudakatz.com/2010/05/17/encodings-unabridged/
and:
http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings
http://graysoftinc.com/character-encodings
Ruby 1.9.3 Invalid byte sequence in UTF-8 explanation needed
I have 64 bit Cygwin, Ruby 2.0.0 and gem 2.4.1 and was experiencing the same issue. gem install ...
, gem update
, everything ended with "ERROR: While executing gem ... (ArgumentError) invalid byte sequence in UTF-8".
I had also all locales set to "en_US.UTF-8".
I have read somewhere that it should help to set LANG
to an empty string or "C.BINARY", but it didn't help. But it was good hint to start experimenting.
Finally I have solved that by setting both LANG
and LC_ALL
to an empty string. All other locale environment variables (LC_CTYPE
etc.) was automatically set to "C.UTF-8" by that, LANG
and LC_ALL
remained empty.
Now gem
is finally working.
UPDATE
It seems that specifically LC_CTYPE
is causing that issue if it's set to UTF-8. So setting it to C.BINARY should help. Other locale environment variables can be set to UTF-8 without affecting it.
export LC_CTYPE=C.BINARY
Determine character encoding in Ruby 1.9.3
The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.
In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.
See here for other ways to encode ç.
As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.
If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:
unless query.valid_encoding?
query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end
Here's the process in IRB (plus an escaping at the end for fun)
a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1") # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"
Encoding error with Rails 2.3 on Ruby 1.9.3
I finally figured out what my issue was. While my databases were encoded with utf8
, the app with the original mysql
gem was injecting latin1
text into the utf8
tables.
What threw me off was that the output from the mysql comand line client looked correct. It is important to verify that your terminal, the database fields and the MySQL client are all running in utf8
.
MySQL's client runs in latin1
by default. You can discover what it is running in by issuing this query:
show variables like 'char%';
If setup properly for utf8
you should see:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
If these don't look correct, make sure the following is set in the [client]
section of your my.cnf
config file:
default-character-set = utf8
Add add the following to the [mysqld]
section:
# use utf8 by default
character-set-server=utf8
collation-server=utf8_general_ci
Make sure to restart the mysql daemon before relaunching the client and then verify.
NOTE: This doesn't change the charset or collation of existing databases, just ensures that any new databases created will default into utf8
and that the client will display in utf8
.
After I did this I saw characters in the mysql client that matched what I was getting from the mysql2
gem. I was also able to verify that this content was latin1
by switching to "encoding: latin1
" temporarily in my database.conf
.
One extremely handy query to find issues is using char length to find the rows with multi-byte characters:
SELECT id, name FROM items WHERE LENGTH(name) != CHAR_LENGTH(name);
There are a lot of scripts out there to convert latin1
contents to utf8
, but what worked best for me was dumping all of the databases as latin1 and stuffing the contents back in as utf8
:
mysqldump -u root -p --opt --default-character-set=latin1 --skip-set-charset DBNAME > DBNAME.sql
mysql -u root -p --default-character-set=utf8 DBNAME < DBNAME.sql
I backed up my primary db first, then dumped into a test database and verified like crazy before rolling over to the corrected DB.
My understanding is that MySQL's translation can leave some things to be desired with certain more complex characters but since most of my multibyte chars are fairly common things (accent marks, quotes, etc), this worked great for me.
Some resources that proved invaluable in sorting all of this out:
- Derek Sivers guide on transforming MySQL data latin1 in utf8 -> utf8
- Blue Box article on MySQL character set hell
- Simple table conversion instructions on Stack Overlow
Ruby 1.9.3 Why does \x03 .force_encoding( UTF-8 ) get \u0003 ,but \x03 .force_encoding( UTF-16 ) gets \x03
Because "\x03"
is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.
That's why "\x03"
can be treated as unicode \u0003
in UTF-8 but not in UTF-16.
To represent "\u0003"
in UTF-16, you have to use two byte, either 00 03
or 03 00
, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be
FE FF 00 03
For the little-endian, the byte sequence should be
FF FE 03 00
The byte order mark should appear at the beginning of a string, or at the beginning of a file.
Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding
is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes
.
If you see "\u0003"
, that doesn't mean you got a String which is represented in two bytes 00 03
, but some byte(s) that represents the Unicode code point 0003
under the specific encoding as carried in that String. It may be:
03 //tagged as UTF-8
FE FF 00 03 //tagged as UTF-16
FF FE 03 00 //tagged as UTF-16
03 //tagged as GBK
03 //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32
UTF-8 encoding not work with gets method in Ruby
Setting the file encoding using the "magic" comment on top of the file only specifies the encoding of your source code in the file (that is: the encoding of string literals created directly from the parser in your code).
Ruby knows two other default encodings:
- the external encoding - this specifies the default encoding of data read from external sources (such as the console, opened files, network sockets, ...)
- the internal encoding - data read from external sources will be transformed into the default internal encoding after reading to ensure you can use compatible encodings everywhere (this is not used by default, the external encoding is thus preserved).
In your case, you have not set the external encoding. On Windows and with Ruby before version 3.0, Ruby assumes the local console encoding of your Windows installation here (such as cp850 in Western Europe).
When Ruby reads your String, it assumes it to be in cp850 encoding (or whatever your default encoding is) while you likely provide utf-8 encoded data. As spoon as you start to operate on this incorrectly encoded data, you will get errors similar to the one you have seen there.
Thus, to be able to correctly read data you need to either provide it with an encoding matching your shell encoding, or you need to tell Ruby which encoding it should assume there.
If you are providing UTF-8 encoded data, you can set the expected encoding using the -E
switch when invoking ruby, e.g.:
ruby -E utf-8 your_program.rb
You can also set this in an environment variable of your Windows shell using
set RUBYOPT=-Eutf-8
In Ruby 3.0, the default external encoding on Windows was changed so that it now defaults to UTF-8 on Windows, similar to other platforms. See https://bugs.ruby-lang.org/issues/16604 for details.
Related Topics
Ruby Net::Http - Following 301 Redirects
Openssl::Ssl::Sslerror: Ssl_Connect Returned=1 Errno=0 State=Unknown State: Unknown Protocol
How to Make Part of a Regular Expression Optional in Ruby
How to Check a Word Is Already All Uppercase
Passing Multiple Error Classes to Ruby's Rescue Clause in a Dry Fashion
How to Merge Array of Hashes to Get Hash of Arrays of Values
Ruby and "You Must Recompile Ruby with Openssl Support or Change the Sources in Your Gemfile"
Comparing Two Arrays Ignoring Element Order in Ruby
Forming Sanitary Shell Commands or System Calls in Ruby
Prevent Rails Test from Deleting Seed Data
Rails on Windows - Install Issue
Can You Eval Code in the Context of a Caller in Ruby
Merging Multi-Dimensional Hashes in Ruby
How to Test a File Upload in Rails
Using Factory_Girl in Rails with Associations That Have Unique Constraints. Getting Duplicate Errors
Where's the Best Place to Define a Constant in a Ruby on Rails Application