How can I compare a unicode type to a string in python?
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
Compare unicode string with byte string
This is probably much easier in Python 3 due to a change in how strings are handled.
Try open
ing your file with the encoding specified and pass the file-like to the csv
library See csv
Examples
import csv
with open('some.csv', newline='', encoding='UTF-16LE') as fh:
reader = csv.reader(fh)
for row in reader: # reader is iterable
# work with row
After some comments, the read attempt comes from a FTP server.
Switching a string read to FTP binary and reading through a io.TextIOWrapper()
may work out
Out now with even more context managers!:
import io
import csv
from ftplib import FTP
with FTP("ftp.example.org") as ftp:
with io.BytesIO() as binary_buffer:
# read all of products.csv into a binary buffer
ftp.retrbinary("RETR products.csv", binary_buffer.write)
binary_buffer.seek(0) # rewind file pointer
# create a text wrapper to associate an encoding with the file-like for reading
with io.TextIOWrapper(binary_buffer, encoding="UTF-16LE") as csv_string:
for row in csv.reader(csv_string):
# work with row
How do I compare a Unicode string that has different bytes, but the same value?
Unicode normalization will get you there for this one:
>>> import unicodedata
>>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099"
True
Use unicodedata.normalize
on both of your strings before comparing them with ==
to check for canonical Unicode equivalence.
Character U+F9FB
is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.
How do I check if a string is unicode or ascii?
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
How to get Unicode input from user in Python?
\u
is an escape sequence recognized in string literals:
Escape sequences only recognized in string literals are:
Escape Meaning Notes
Sequence
\N{name} Character named name
in the Unicode database (4)
\uxxxx Character with 16-bit hex value xxxx (5)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)Notes:
- Changed in version 3.3: Support for name aliases 1 has been added.
- Exactly four hex digits are required.
- Any Unicode character can be encoded this way. Exactly eight hex digits are required.
Use
varUnicode = input('\tEnter your Unicode\n\t>')
print('\\u{}'.format(varUnicode.zfill(4)).encode('raw_unicode_escape').decode('unicode_escape'))
or (maybe better)
varUnicode = input('\tEnter your Unicode\n\t>')
print('\\U{}'.format(varUnicode.zfill(8)).encode('raw_unicode_escape').decode('unicode_escape'))
Related Topics
How to Remove Leading and Trailing Zeros in a String? Python
Sorting a 2D Numpy Array by Multiple Axes
Convert from Ascii String Encoded in Hex to Plain Ascii
Python Integer Incrementing with ++
Import CSV with Different Number of Columns Per Row Using Pandas
Pip Broke. How to Fix Distributionnotfound Error
How to Upload a File to Directory in S3 Bucket Using Boto
Is Generator.Next() Visible in Python 3
Suppress Insecurerequestwarning: Unverified Https Request Is Being Made in Python2.6
Convert Pandas Series to Dataframe
Understanding Time.Perf_Counter() and Time.Process_Time()
[] and {} VS List() and Dict(), Which Is Better
Detect Text Region in Image Using Opencv
How to Log Server Errors on Django Sites
Differencebetween a Pandas Series and a Single-Column Dataframe
Python Worker Failed to Connect Back
Trying to Delay a Specific Function for Spawning Enemy After a Certain Amount of Time