Ruby 1.9, Force_Encoding, But Check

ruby 1.9, force_encoding, but check

(update: see https://github.com/jrochkind/scrub_rb)

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

"€foo\xA0".chars.select(&:valid_encoding?).join

Ruby 1.9.3 Why does \x03.force_encoding(UTF-8) get \u0003 ,but \x03.force_encoding(UTF-16) gets \x03

Because "\x03" is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.

That's why "\x03" can be treated as unicode \u0003 in UTF-8 but not in UTF-16.

To represent "\u0003" in UTF-16, you have to use two byte, either 00 03 or 03 00, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be

FE FF 00 03

For the little-endian, the byte sequence should be

FF FE 03 00

The byte order mark should appear at the beginning of a string, or at the beginning of a file.

Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes.

If you see "\u0003", that doesn't mean you got a String which is represented in two bytes 00 03, but some byte(s) that represents the Unicode code point 0003 under the specific encoding as carried in that String. It may be:

03              //tagged as UTF-8
FE FF 00 03 //tagged as UTF-16
FF FE 03 00 //tagged as UTF-16
03 //tagged as GBK
03 //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32

Can I set the default string encoding on Ruby 1.9?

Don't confuse file encoding with string encoding

The purpose of the #encoding statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.

To define the encoding in all your files at once, you can use the
magic_encoding gem
, it can insert uft-8 magic comment to all ruby files in your app.

The error you're getting at runtime Encoding::CompatibilityError is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.

This most likely happens when:

  • you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)

  • the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII). force_encoding will help there. There's also Encoding::primary_encoding in Rails 1.9 to set the default encoding for new Strings.
    And there is config.encoding in Rails in the config/application.rb file.

  • String which come from your database, and then are combined with other Strings in your view.
    (their encodings could be either way around, and incompatible).

Side-Note: Make sure to specify a default encoding when you create your database!

    create database yourproject  DEFAULT CHARACTER SET utf8;

If you want to use EMOJIs in your strings:

    create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;

and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin

The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.

Please check this Yehuda Katz article, which covers this in-depth, and explains it very well:
(there is specifically a section 'Incompatible Encodings')

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

http://yehudakatz.com/2010/05/17/encodings-unabridged/

and:

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

http://graysoftinc.com/character-encodings

ruby 1.9 wrong file encoding on windows

You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.

File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }

# => ISO-8859-1

Also note that you probably mean 'fòo'.encode('iso-8859-1') rather than 'fòo'.force_encoding('iso-8859-1'). The latter leaves the bytes unchanged, while the former transcodes the string.

Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.

  1. If you don't specify an encoding with File.read(), the file will be read with Encoding.default_external. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.

  2. force_encoding() doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1". encode() transcodes for you, as does writing to the file if you don't trick it into not doing so.

Putting those together, if you have a source file in ISO-8859-1:

# encoding: iso-8859-1

# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

If you have a source file in UTF-8:

# encoding: utf-8

# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

Update 2, to answer your new questions:

  1. No, the # encoding: iso-8859-1 line does not change Encoding.default_external, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply add

    Encoding.default_external = "iso-8859-1"

    if you expect all files that your read to be stored in that encoding.

  2. No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.

  3. Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.

String.force_encoding() in Ruby 1.8.7 (or Rails 2.x)

The only thing force_encoding does in 1.9 is that it changes the encoding field of the string, it does not actually modify the string's bytes.

Ruby 1.8 doesn't have the concept of string encodings, so force_encoding would be a no-op. You can add it yourself like this if you want to be able to run the same code in 1.8 and 1.9:

class String
def force_encoding(enc)
self
end
end

There will of course be other things that you have to do to make encodings work the same across 1.8 and 1.9, since they handle this issue very differently.

Ruby invalid multibyte char error (Sep 2019)

I could reproduce your issue by saving the file with ISO-8859-1 encoding.

Running your code with the file in this non UTF8-encoding the error popped up. My solution was to save the file as UTF-8.

I am using Sublime as text editor and there is the option 'file > save with encoding'. I have chosen 'UTF-8' and was able to run the script.

Using puts line.encoding showed me UTF-8 then and no error anymore.

I suggest to re-check the encoding of your saved script file again.

Character encoding with Ruby 1.9.3 and the mail gem

After playing a bit, I found this:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part

You can extract the charset from the message like so.

message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset

Be careful with non-multipart, as the following can cause trouble:

body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)

Character encoding with Ruby 1.9.3 and the mail gem

After playing a bit, I found this:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part

You can extract the charset from the message like so.

message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset

Be careful with non-multipart, as the following can cause trouble:

body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)


Related Topics



Leave a reply



Submit