Print Unicode Escape Codes from Variable

Print Unicode escape codes from variable

One can not use \u along with string interpolation, since \u takes precedence. What one might do, is to Array#pack an array of integers:

▶ data.map { |e| e.to_i(16) }.pack 'U*'
#⇒ "br>

How to print out strings with unicode escape characters correctly

The \u00e0 is being stored as a Unicode number for python so that it is printed as a 'à'. When you get it from another file, it is completely in string form meaning it is then stored as a '\\u00e0' where every character is a string.
A solution to this would be to identify where the '\\u00e0' is in the list and then replace it with the '\u00e0'

Here is some code that will convert the '\\u00e0' in the string into the character its supposed to be.

def special_char_fix(string):
string = list(string)
for pl, char in enumerate(string):
if char == '\\':
val = ''.join([string[pl + k + 2] for k in range(4)])
for k in range(5):
string.pop(pl)
string[pl] = str(chr(int(val, 16)))
return ''.join(string)

How to print a variable that contains a unicode character?

The \u escape is not meaningful inside a non-unicode string. You need to do a = u'\u0D05'.

If you're saying you're getting the string from somewhere else and need to interpret unicode escapes in it, then do print a.decode('unicode-escape')

How to print Unicode like “u{variable}” in Python 2.7?

This is probably not a great way, but it's a start:

>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä

First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.

Since you are doing this a lot, you can precompile the struct:

int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')

If you think it's clearer, you can also use the decode method instead of the unicode type directly:

>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä

Python 3 added a to_bytes method to the int class that lets you bypass the struct module:

>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"

How to get the Unicode character from a code point variable?

All you need is a \ before u05e2. To print a Unicode character, you must provide a unicode format string.

a = '\u05e2'
print(u'{}'.format(a))

#Output
ע

When you try the other approach by printing the \ within the print() function, Python first escapes the \ and does not show the desired result.

a = 'u05e2'
print(u'\{}'.format(a))

#Output
\u05e2

A way to verify the validity of Unicode format strings is using the ord() built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.

a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a

To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c. This is explained in the Python docs.

print('{0:c}'.format(1506))

#Output
ע

If we pass a normal string literal to ord(), we get an error. This is because this string does not represent a Unicode character.

a = 'u05e2'
print(ord(a))

#Error
TypeError: ord() expected a character, but string of length 5 found

Printing escaped Unicode in Python

>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…

You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.

Unicode characters printed as escape sequences inside python object

In python2, when you print out a list, you end up printing the repr of the contents of that list.

In python3, a string's repr is the same as its str return value. You can observe this below:

Python2

>>> val = "אבג".decode('utf-8')
>>> val # displays repr value
u'\u05d0\u05d1\u05d2'
>>> print val # displays str value
אבג

And, as mentioned,

>>> print [val]
[u'\u05d0\u05d1\u05d2']

Constrasting with python3, str objects do not have a decode function - they are already decoded.

>>> val = "אבג"
>>> val
'אבג'
>>> print(val)
אבג
>>> print([val])
['אבג']

You can see this is why it works now.

For your problem, if you want to view the character as it is when you print the dict, you can do this:

print dict['LOCID']

Side note, do not use dict to name variables since it shadows the very important builtin class you are using.

Print unicode from formated string

"\u{}" throws that error because the string representation \unnnn is not supposed to work with variables; it's a literal, immediate value. Much like you cannot do x = 't'; print ('a\{}b'.format(x)) and expect a tab between a and b.

To print any Unicode character, either enter its literal code immediately into the string itself:

 print ('Hello \u2665 world')

result:

Hello ♥ world

– do note that you don't need the u prefix on the string itself; that's a Python 2.x'ism –, or, if you want to provide the character value in a variable:

print ('Hello {:c} world'.format(0x2665))

where (1) the :c forces a character representation of the value, and (2) you need to indicate that the value itself is in hex. (As the string representation \unnnn is always in hex.)

How to escape unicode special chars in string and write it to UTF encoded file

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

  • Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
  • Upper-case encoding, as specified (\u00FC rather than \u00fc)
  • Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
  • It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
    enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)


Related Topics



Leave a reply



Submit