What is the difference between #encode and #force_encoding in ruby?
Difference is pretty big. force_encoding
sets given string encoding, but does not change the string itself, i.e. does not change it representation in memory:
'łał'.bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII').bytes #=> [197, 130, 97, 197, 130]
'łał'.force_encoding('ASCII') #=> "\xC5\x82a\xC5\x82"
encode
assumes that the current encoding is correct and tries to change the string so it reads same way in second encoding:
'łał'.encode('UTF-16') #=> 'łał'
'łał'.encode('UTF-16').bytes #=> [254, 255, 1, 65, 0, 97, 1, 66]
In short, force_encoding
changes the way string is being read from bytes, and encode
changes the way string is written without changing the output (if possible)
ruby 1.9, force_encoding, but check
(update: see https://github.com/jrochkind/scrub_rb)
So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb
But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":
a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"
Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!
Force strings to UTF-8 from any encoding
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode
:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
Ruby converting string encoding from ISO-8859-1 to UTF-8 not working
You assign a string, in UTF-8. It contains ä
. UTF-8 represents ä
with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä
any more. It contains two characters, Ã
and ¤
.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8
. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
In Ruby, how to UTF-8 encode this weird character?
I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.
fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}
str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')
The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.
Handling encoding in ruby
I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.
This produces a double-encoded string with UTF-8 characters:
> str = "汉语 / 漢語"
=> "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
=> "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
=> "æ±\u0089è¯ / æ¼¢èª\u009E"
You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8
> bad.encode("iso-8859-1").force_encoding("utf-8")
=> "汉语 / 漢語"
But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding
> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1
Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."
So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:
> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
.0=> " / "
The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.
> str_bom = "\xEF\xBB\xBF" + str
=> "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
=> true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
=> false
If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:
> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
=> true
If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.
Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:
str = "汉语 / 漢語"
begin
str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
str
end
However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:
> bad_str = str.force_encoding("windows-1252").encode("utf-8")
=> "æ±‰è¯ / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1
Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.
To echo the post I linked: character encodings are hard.
Ruby encoding ASCII_8BIT and extended ASCII
String literals are (usually) UTF-8 encoded regardless of whether or not the bytes are valid UTF-8. Hence this:
"\x8f".encoding
saying UTF-8 even though the string isn't valid UTF-8. You should be safe using String#force_encoding
but if you really want to work with raw bytes, you might be better of working with arrays of integers and using Array#pack
to mash them into strings:
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*')
# "\x8F\x11\x06#\xFF\x00"
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').encoding
# #<Encoding:ASCII-8BIT>
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').bytes
# [143, 17, 6, 35, 255, 0]
The results should be the same but, IMO, this is explicitly working with binary data (i.e. raw bytes), makes your intent clear, and should avoid any encoding problems.
There's also String#unpack
if there is a known structure to the bytes you're reading and you want to crack it open.
Ruby String encoding changed over versions
If you look at the differences in documentation between 2.0 and 2.1 documentation you will see the following text disappeared in 2.1:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
So this behaviour where 2.0 and lower did not modify the string when the source and target encodings where the same, and 2.1+ does, appears to be an intended change.
I'm not 100% sure what your code is trying to do, but if it's trying to clean up the string from invalid UTF-8 byte sequence, you can use valid_encoding?
and scrub
as of Ruby 2.1:
irb(main):055:0* content = "Is your pl\xFFace available?"
=> "Is your pl\xFFace available?"
irb(main):056:0> content.valid_encoding?
=> false
irb(main):057:0> new = content.scrub
=> "Is your pl�ace available?"
irb(main):059:0> new.valid_encoding?
=> true
EDIT:
If you look through the 2.0 source code, you will see the str_transcode0
function exits immediately if senc (source encode) is the same as denc (destination encode):
if (senc && senc == denc) {
return NIL_P(arg2) ? -1 : dencidx;
}
In 2.1 it scrubs the data when the encodings are the same and you explicitly asked to replace invalid sequences:
if (senc && senc == denc) {
...
if ((ecflags & ECONV_INVALID_MASK) && explicitly_invalid_replace) {
dest = rb_str_scrub(str, rep);
}
...
}
Related Topics
Pg Error Could Not Connect to Server: Connection Refused Is the Server Running on Port 5432
Why Doesn't This Work If in Ruby Everything Is an Object
What Is a Worker in Ruby/Rails
Sinatra - Response.Set_Cookie Doesn't Work
Constants or Class Variables in Ruby
Ruby: How to "Require" a File from the Current Working Dir
Connecting Using Https to a Server with a Certificate Signed by a Ca I Created
Why Can't I Install SASS on MAC Os Sierra
Detect Rspec Test Failure on After Each Method
Why Do I Get "No Implicit Conversion of String into Integer (Typeerror)"
Ruby Create Recursive Directory Tree
How to Render a Plain HTML File with Sinatra
How to Setup Urls for Static Site with Ruby Rack on Heroku