How to Read Unicode Input and Compare Unicode Strings in Python

How can I compare a unicode type to a string in python?

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response)
myList = [item for item in data if item == "number1"]

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]

Both versions work fine:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"]

Compare unicode string with byte string

This is probably much easier in Python 3 due to a change in how strings are handled.

Try opening your file with the encoding specified and pass the file-like to the csv library See csv Examples

import csv
with open('some.csv', newline='', encoding='UTF-16LE') as fh:
reader = csv.reader(fh)
for row in reader: # reader is iterable
# work with row

After some comments, the read attempt comes from a FTP server.

Switching a string read to FTP binary and reading through a io.TextIOWrapper() may work out

Out now with even more context managers!:

import io
import csv
from ftplib import FTP

with FTP("ftp.example.org") as ftp:
with io.BytesIO() as binary_buffer:
# read all of products.csv into a binary buffer
ftp.retrbinary("RETR products.csv", binary_buffer.write)
binary_buffer.seek(0) # rewind file pointer
# create a text wrapper to associate an encoding with the file-like for reading
with io.TextIOWrapper(binary_buffer, encoding="UTF-16LE") as csv_string:
for row in csv.reader(csv_string):
# work with row

How do I compare a Unicode string that has different bytes, but the same value?

Unicode normalization will get you there for this one:

>>> import unicodedata
>>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099"
True

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.

How do I check if a string is unicode or ascii?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

How to get Unicode input from user in Python?

\u is an escape sequence recognized in string literals:

Escape sequences only recognized in string literals are:

Escape      Meaning                                  Notes
Sequence

\N{name} Character named name
in the Unicode database (4)
\uxxxx Character with 16-bit hex value xxxx (5)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)

Notes:


  1. Changed in version 3.3: Support for name aliases 1 has been added.
  2. Exactly four hex digits are required.
  3. Any Unicode character can be encoded this way. Exactly eight hex digits are required.

Use

varUnicode = input('\tEnter your Unicode\n\t>')
print('\\u{}'.format(varUnicode.zfill(4)).encode('raw_unicode_escape').decode('unicode_escape'))

or (maybe better)

varUnicode = input('\tEnter your Unicode\n\t>')
print('\\U{}'.format(varUnicode.zfill(8)).encode('raw_unicode_escape').decode('unicode_escape'))


Related Topics



Leave a reply



Submit