Differencebetween #Encode and #Force_Encoding in Ruby

What is the difference between #encode and #force_encoding in ruby?

Difference is pretty big. force_encoding sets given string encoding, but does not change the string itself, i.e. does not change it representation in memory:

'łał'.bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII').bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII') #=> "\xC5\x82a\xC5\x82"

encode assumes that the current encoding is correct and tries to change the string so it reads same way in second encoding:

'łał'.encode('UTF-16') #=> 'łał'
'łał'.encode('UTF-16').bytes #=> [254, 255, 1, 65, 0, 97, 1, 66]

In short, force_encoding changes the way string is being read from bytes, and encode changes the way string is written without changing the output (if possible)

ruby 1.9, force_encoding, but check

(update: see https://github.com/jrochkind/scrub_rb)

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

Force strings to UTF-8 from any encoding

Ruby 1.9

"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:

str = str.force_encoding('UTF-8')

str.encoding.name # => 'UTF-8'

If you want to perform a conversion, use encode:

begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end

I would definitely read the following post for more information:

http://graysoftinc.com/character-encodings/ruby-19s-string

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

In Ruby, how to UTF-8 encode this weird character?

I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.

fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}

str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')

The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.

Handling encoding in ruby

I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.

This produces a double-encoded string with UTF-8 characters:

> str = "汉语 / 漢語"
=> "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
=> "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
=> "æ±\u0089语 / æ¼¢èª\u009E"

You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8

> bad.encode("iso-8859-1").force_encoding("utf-8")
=> "汉语 / 漢語"

But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding

> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1

Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."

So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:

> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
.0=> " / "

The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.

> str_bom = "\xEF\xBB\xBF" + str
=> "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
=> true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
=> false

If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:

> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
=> true

If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.

Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:

str = "汉语 / 漢語"
begin
str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
str
end

However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:

> bad_str = str.force_encoding("windows-1252").encode("utf-8")
=> "汉语 / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1

Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.

To echo the post I linked: character encodings are hard.

Ruby encoding ASCII_8BIT and extended ASCII

String literals are (usually) UTF-8 encoded regardless of whether or not the bytes are valid UTF-8. Hence this:

"\x8f".encoding

saying UTF-8 even though the string isn't valid UTF-8. You should be safe using String#force_encoding but if you really want to work with raw bytes, you might be better of working with arrays of integers and using Array#pack to mash them into strings:

[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*')
# "\x8F\x11\x06#\xFF\x00"
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').encoding
# #<Encoding:ASCII-8BIT>
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').bytes
# [143, 17, 6, 35, 255, 0]

The results should be the same but, IMO, this is explicitly working with binary data (i.e. raw bytes), makes your intent clear, and should avoid any encoding problems.

There's also String#unpack if there is a known structure to the bytes you're reading and you want to crack it open.

Ruby String encoding changed over versions

If you look at the differences in documentation between 2.0 and 2.1 documentation you will see the following text disappeared in 2.1:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

So this behaviour where 2.0 and lower did not modify the string when the source and target encodings where the same, and 2.1+ does, appears to be an intended change.

I'm not 100% sure what your code is trying to do, but if it's trying to clean up the string from invalid UTF-8 byte sequence, you can use valid_encoding? and scrub as of Ruby 2.1:

irb(main):055:0* content = "Is your pl\xFFace available?"
=> "Is your pl\xFFace available?"
irb(main):056:0> content.valid_encoding?
=> false
irb(main):057:0> new = content.scrub
=> "Is your pl�ace available?"
irb(main):059:0> new.valid_encoding?
=> true

EDIT:

If you look through the 2.0 source code, you will see the str_transcode0 function exits immediately if senc (source encode) is the same as denc (destination encode):

    if (senc && senc == denc) {
return NIL_P(arg2) ? -1 : dencidx;
}

In 2.1 it scrubs the data when the encodings are the same and you explicitly asked to replace invalid sequences:

    if (senc && senc == denc) {
...
if ((ecflags & ECONV_INVALID_MASK) && explicitly_invalid_replace) {
dest = rb_str_scrub(str, rep);
}
...
}


Related Topics



Leave a reply



Submit