Convert Unicode Codepoint to String Character in Ruby

Convert unicode codepoint to string character in Ruby

How about:

# Using pack
puts ["2B71F".hex].pack("U")

# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)

In Ruby 1.9+ you can also do:

puts "\u{2B71F}"

I.e. the \u{} escape sequence can be used to decode Unicode codepoints.

Use Ruby to generate hex codepoints for Unicode values

Use String#rjust:

[97, 127016].map { |i| "U+" << i.to_s(16).upcase.rjust(4, '0') }
#⇒ ["U+0061", "U+1F028"]

For other operations:

"U+0061"[/(?<=\AU\+).*/].to_i(16)
#⇒ 97
"U+0061"[/(?<=\AU\+).*/].prepend('0x')
#⇒ "0x0061"

NB: 0x61 might live as string only, since 0x61 and 97 are the same value internally, both represented by 97.

convert unicode into character with ruby

[22269].pack('U*') #=> "国" or "\345\233\275"

Edit: Works in 1.8.6+ (verified in 1.8.6, 1.8.7, and 1.9.2). In 1.8.x you get a three-byte string representing the single Unicode character, but using puts on that causes the correct Chinese character to appear in the terminal.

How to convert a unicode string to its symbol characters in Ruby?

In Ruby 1.9:

"\u041D\u043E\u0432\u0438\u043D\u0438".encode("UTF-8")
=> "Новини"

Convert unicode to characters in a file using Ruby

So, I have tried to reproduce your problem and got the same result as described by using your solution.

I have noticed that \u003B (for example) is a unicode code for semicolon character. So, I analyzed the string for each "U+" notation using regex /\\u(.{4})/, as it marks "hexadecimal digits" as being Unicode code points. Then used gsub! and Array#pack to convert and substitute each of the Unicode chars.

[$1.to_i(16)].pack('U') # => "\n", "\n", "<", "&", "\n", "=" ...etc.

And finally wrote the result to a file. So, my final approach looks like this:

code = File.read('code.txt')

code.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end

File.open('solution.cpp', 'w') { |f| f.puts code.gsub!(/\A"|"\Z/, '') }

Also note, I have used gsub again at the end, to search for the leading or trailing quote and replace it with an empty string when writing to a file.

Replacing %uXXXX to the corresponding Unicode codepoint in Ruby

Try this code:

string.gsub(/%u([0-9A-F]{4})/i){[$1.hex].pack("U")}

In the comments, cremno has a better faster solution:

string.gsub(/%u([0-9A-F]{4})/i){$1.hex.chr(Encoding::UTF_8)}

In the comments, bobince adds important restrictions, worth reading in full.

converting Unicode code point numbers to Unicode characters

What you may want to look at is the raw_unicode_escape encoding.

>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1

So, the function would be:

def ParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString

This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:

import re

escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')

def _escape_sequence_to_char(match):
return chr(int(match[0][2:], 16))

def ParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)


Related Topics



Leave a reply



Submit