Print Unicode escape codes from variable
One can not use \u
along with string interpolation, since \u
takes precedence. What one might do, is to Array#pack
an array of integers:
▶ data.map { |e| e.to_i(16) }.pack 'U*'
#⇒ "br>
How to print out strings with unicode escape characters correctly
The \u00e0
is being stored as a Unicode number for python so that it is printed as a 'à'. When you get it from another file, it is completely in string form meaning it is then stored as a '\\u00e0'
where every character is a string.
A solution to this would be to identify where the '\\u00e0'
is in the list and then replace it with the '\u00e0'
Here is some code that will convert the '\\u00e0'
in the string into the character its supposed to be.
def special_char_fix(string):
string = list(string)
for pl, char in enumerate(string):
if char == '\\':
val = ''.join([string[pl + k + 2] for k in range(4)])
for k in range(5):
string.pop(pl)
string[pl] = str(chr(int(val, 16)))
return ''.join(string)
How to print a variable that contains a unicode character?
The \u
escape is not meaningful inside a non-unicode string. You need to do a = u'\u0D05'
.
If you're saying you're getting the string from somewhere else and need to interpret unicode escapes in it, then do print a.decode('unicode-escape')
How to print Unicode like “u{variable}” in Python 2.7?
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x
. We pack that into a byte string, which we can then decode using the utf_32_be
encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode
method instead of the unicode
type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes
method to the int
class that lets you bypass the struct
module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
How to get the Unicode character from a code point variable?
All you need is a \
before u05e2
. To print a Unicode character, you must provide a unicode format string.
a = '\u05e2'
print(u'{}'.format(a))
#Output
ע
When you try the other approach by printing the \
within the print()
function, Python first escapes the \
and does not show the desired result.
a = 'u05e2'
print(u'\{}'.format(a))
#Output
\u05e2
A way to verify the validity of Unicode format strings is using the ord()
built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.
a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a
To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c
. This is explained in the Python docs.
print('{0:c}'.format(1506))
#Output
ע
If we pass a normal string literal to ord()
, we get an error. This is because this string does not represent a Unicode character.
a = 'u05e2'
print(ord(a))
#Error
TypeError: ord() expected a character, but string of length 5 found
Printing escaped Unicode in Python
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii
creates a bytes
object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b
and quotes. Using decode
explicitly converts it back to a string; the default encoding is utf-8
, and since your bytes
only consist of ascii
which is a subset of utf-8
it is guaranteed to work.
Unicode characters printed as escape sequences inside python object
In python2, when you print out a list, you end up printing the repr
of the contents of that list.
In python3, a string's repr
is the same as its str
return value. You can observe this below:
Python2
>>> val = "אבג".decode('utf-8')
>>> val # displays repr value
u'\u05d0\u05d1\u05d2'
>>> print val # displays str value
אבג
And, as mentioned,
>>> print [val]
[u'\u05d0\u05d1\u05d2']
Constrasting with python3, str
objects do not have a decode
function - they are already decoded.
>>> val = "אבג"
>>> val
'אבג'
>>> print(val)
אבג
>>> print([val])
['אבג']
You can see this is why it works now.
For your problem, if you want to view the character as it is when you print the dict, you can do this:
print dict['LOCID']
Side note, do not use dict
to name variables since it shadows the very important builtin class you are using.
Print unicode from formated string
"\u{}"
throws that error because the string representation \unnnn
is not supposed to work with variables; it's a literal, immediate value. Much like you cannot do x = 't'; print ('a\{}b'.format(x))
and expect a tab between a
and b
.
To print any Unicode character, either enter its literal code immediately into the string itself:
print ('Hello \u2665 world')
result:
Hello ♥ world
– do note that you don't need the u
prefix on the string itself; that's a Python 2.x'ism –, or, if you want to provide the character value in a variable:
print ('Hello {:c} world'.format(0x2665))
where (1) the :c
forces a character representation of the value, and (2) you need to indicate that the value itself is in hex. (As the string representation \unnnn
is always in hex.)
How to escape unicode special chars in string and write it to UTF encoded file
Another solution, not relying on the built-in repr()
but rather implementing it from scratch:
orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'
enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)
print(enc)
Differences:
- Encodes only using
\u
, never any other sequence, whereasrepr()
uses about a third of the alphabet (so for example the BEL character will be encoded as\u0007
rather than\a
) - Upper-case encoding, as specified (
\u00FC
rather than\u00fc
) - Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
- It does not take care of any pre-existing
\u
sequences, whereasrepr()
turns those into\\u
; could be extended, perhaps to encode\
as\u005C
:enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
Related Topics
Ruby Instance Method & Conditional Local Variable Assignment with Same Name
Ruby: How to Process a CSV File with "Bad Commas"
Save Google Cloud Speech API Operation(Job) Object to Retrieve Results Later
Scope That Has Three Levels Deep Joins
Linking to External File in Ruby on Rails
Undefined Method 'Click' for Nil:Nilclass (Mechanize)
Removing All Installed Gems and Starting Over
Modern Tools for Ruby/Rails for Building an Achievement System
Ruby Executable Won't Start on Win10 and Win7
Ruby: Append Text to the 2Nd Line of a File
How to Render the Ajax Response in Rails
Perfect Way to Write a Gsub for a Regex Match
Iterating Over the Registers of a Yardoc '@Macro'
Ruby, No Implicit Conversion of Symbol into Integer