How to Convert Utf8 Combined Characters into Single Utf8 Characters in Ruby

How to convert UTF8 combined Characters into single UTF8 characters in ruby?

Generally, you use Unicode Normalization to do this.

Using UnicodeUtils.nfkc using the gem unicode_utils (https://github.com/lang/unicode_utils) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).

How to replace the Unicode gem on Ruby 1.9? has additional details.

In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.

Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).

In Ruby, how to UTF-8 encode this weird character?

I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.

fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}

str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')

The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')

Ruby: Generate a utf-8 character from code point as string

(this will work if you have ruby 1.9 or newer)

#irb -E utf-8
irb(main):032:0> s=""
=> ""
irb(main):033:0> i=0x328e
=> 12942
irb(main):034:0> s<<i
=> "㊎"
irb(main):036:0> s<<0x5363
=> "㊎卣"

for your case:

my_char_codes = ["5363","328E"]
s = ""
my_char_codes.each{ |c| s << c.to_i(16) }

# now s contains "㊎卣"

Convert UTF-8 to CP1252 ruby 2.2

UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.

In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.

The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.

One option might be normalisation, which will convert the string into a form that is representable in CP1252.

file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')

(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)

This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Effectively UTF-8 encode a string

"J\u00E9r\u00E9my".encoding
#=> #<Encoding:UTF-8>
"J\u00E9r\u00E9my".each_codepoint.to_a
#=> [74, 233, 114, 233, 109, 121]

The strings are perfectly fine. They contain the correct bytes and have the correct encoding.

They are printed this way because your external encoding is set to (or recognised as) US-ASCII:

Encoding.default_external
#=> #<Encoding:US_ASCII>

Ruby assumes that your terminal can only render ASCII characters and therefore prints UTF-8 characters using escape sequences. (when using p / String#inspect)

The external encoding is usually determined automatically based on your locale:

$ LANG=C            ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>

$ LANG=en_US.UTF-8 ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>

Setting your terminal's or system's encoding / locale to UTF-8 should fix the problem.

How to convert with Ruby accented characters in HTML special entities

I had explicitly set the $KCODE to make your example work. Also, make sure your source file is actually encoded as UTF-8!

# coding: utf-8
require 'rubygems'
require 'htmlentities'
require 'unicode'
$KCODE = 'UTF-8'
coder = HTMLEntities.new
string = "Scròfina"
puts coder.encode(string, :named)


Related Topics



Leave a reply



Submit