Ruby Converting String Encoding from Iso-8859-1 to Utf-8 Not Working

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

How to convert UTF-8 to ISO-8859-1 in Ruby 2.0?

The encode method does work.

Let's create a string with U+00FC (ü):

uuml_utf8 = "\u00FC"       #=> "ü"

Ruby encodes this string in UTF-8:

uuml_utf8.encoding         #=> #<Encoding:UTF-8>

In UTF-8, ü is represented as 195 188 (decimal):

uuml_utf8.bytes            #=> [195, 188]

Now let's convert the string to ISO-8859-1:

uuml_latin1 = uuml_utf8.encode("ISO-8859-1")

uuml_latin1.encoding #=> #<Encoding:ISO-8859-1>

In ISO-8859-1, ü is represented as 252 (decimal):

uuml_latin1.bytes          #=> [252]

In UTF-8 however 252 is an invalid sequence. That's why your terminal/console displays the replacement character "�" (U+FFFD) or no character at all.

In order to display ISO-8859-1 encoded characters, you'll have to switch your terminal/console to that encoding, too.

UTF-8 conversion not working with String#encode but Iconv

In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.

String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:

git_log = git_log.encode(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')

You could also use the ! form in this case, which has the same effect:

git_log.encode!(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')

How can i transform the utf8 chars to iso8859-1

~ UPDATE ~

ruby-iconv has been superseded from Ruby 1.9.3 onwards by the encode method. See
Jörg W Mittag's answer for details, but in short:

utf8string = "èàòppè"
iso_string = utf8string.encode('ISO-8859-1')

I agree with Williham Totlandt in thinking that this type of conversion might not be the smartest idea ever, but anyway: use ruby-iconv :)

utf8string = "èàòppè"
iso_string = Iconv.conv 'iso8859-1', 'UTF-8', utf8string

Encoding in ruby utf-8 error

You have to encode string and force encoding after this.

"Nous travaillons á rendre".encode("Windows-1252").force_encoding("utf-8")

Result:

"Nous travaillons á rendre"

Handling encoding in ruby

I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.

This produces a double-encoded string with UTF-8 characters:

> str = "汉语 / 漢語"
=> "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
=> "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
=> "æ±\u0089语 / æ¼¢èª\u009E"

You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8

> bad.encode("iso-8859-1").force_encoding("utf-8")
=> "汉语 / 漢語"

But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding

> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1

Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."

So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:

> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
.0=> " / "

The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.

> str_bom = "\xEF\xBB\xBF" + str
=> "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
=> true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
=> false

If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:

> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
=> true

If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.

Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:

str = "汉语 / 漢語"
begin
str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
str
end

However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:

> bad_str = str.force_encoding("windows-1252").encode("utf-8")
=> "汉语 / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1

Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.

To echo the post I linked: character encodings are hard.



Related Topics



Leave a reply



Submit