Writing UTF-8 String to MySQL with Python
I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8')
did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap
How to fetch and print utf-8 data from mysql DB using Python?
Your problem is with your terminal (sys.stdout
) encoding (cf http://en.wikipedia.org/wiki/Code_page_862), which depends on your system's settings. The best solution (as explained here : https://stackoverflow.com/a/15740694/41316) is to explicitely encode your unicode data before printing them to sys.stdout
.
If you can't use a more usable encoding (utf-8 comes to mind, as it has been designed to handle all unicode characters), you can at least use an alternative error handling like "replace" (replaces non-encodable characters with '?') or "ignore" (suppress non-encodable characters).
Here's a corrected version of your code, you can play with the encoding
and on_error
settings to find out what solution works for you:
import sys
import MySQLdb
# set desired output encoding here
# it looks like your default encoding is "cp862"
# but you may want to first try 'utf-8' first
# encoding = "cp862"
encoding = "utf-8"
# what do when we can't encode to the desired output encoding
# options are:
# - 'strict' : raises a UnicodeEncodeError (default)
# - 'replace': replaces missing characters with '?'
# - 'ignore' : suppress missing characters
on_error = "replace"
db = MySQLdb.connect(
"localhost","matan","pass","youtube",
charset='utf8',
use_unicode=True
)
cursor = db.cursor()
sql = "SELECT * FROM VIDEOS"
try:
cursor.execute(sql)
for i, row in enumerate(cursor):
try:
# encode unicode data to the desired output encoding
title = row[0].encode(encoding, on_error)
link = row[1].encode(encoding, on_error)
except UnicodeEncodeError as e:
# only if on_error='strict'
print >> sys.stderr, "failed to encode row #%s - %s" % (i, e)
else:
print "title=%s\nlink=%s\n\n" % (title, link))
finally:
cursor.close()
db.close()
NB : you may also want to read this (specially the comments) http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/ for more on Python, strings, unicode, encoding, sys.stdout
and terminal issues.
How to correctly insert utf-8 characters into a MySQL table using python
Did you try, this query set names utf8;
#!/usr/bin/python
# -*- coding: utf-8 -*-
import MySQLdb
mystring = "Bientôt l'été"
myinsert = [{ "name": mystring.encode("utf-8").strip()[:65535], "id": 1 }]
con = MySQLdb.connect('localhost', 'abc', 'def', 'ghi');
cur = con.cursor()
cur.execute("set names utf8;") # <--- add this line,
sql = "INSERT INTO 'MyTable' ( 'my_id', 'my_name' ) VALUES ( %(id)s, %(name)s ) ; "
cur.executemany( sql, myinsert )
con.commit()
if con: con.close()
UnicodeEncodeError when trying to insert UTF-8 data into Mysql
The major issue is that you're calling str()
on Unicode objects. Depending on many factors, this may result in Python trying to encode the Unicode into ASCII, which is not possible with non-ASCII chars.
You should try to keep Unicode objects as Unicode objects for as long as possible in your code and only convert when it's totally necessary. Fortunately, the MySQL driver is Unicode compliant, so you can pass it Unicode strings and it will encode internally. The only thing you need to do is to tell the driver to use UTF-8. Feedparser is also Unicode compliant and is decoding the rss feed automatically to Unicode strings (strings without encoding).
There's also some parts of your code, which would benefit from using Python's in built features like for each in something:
, String.format()
, and triple quotes ("""
) for long pieces of text.
Pulling this all together looks like:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import MySQLdb
import feedparser
db = MySQLdb.connect(host="127.0.0.1",
user="root",
passwd="",
db="FeedStuff",
charset='UTF8')
urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']
cur = db.cursor()
for uri in urllistbernerz:
urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/{uri}/rss.html'.format(uri=uri))
for entry in urlbernerz.entries:
insert_sql = u"""INSERT INTO articles (title, description, date, category,
link, source) VALUES ("{e.title}", "{e.description}",
"{e.published}", "{e.category}", "{e.link}", "Berner Zeitung")
""".format(e=entry)
cur.execute(insert_sql)
for uri in urllistnzz:
urlnzz = feedparser.parse('http://www.nzz.ch/{uri}.rss'.format(uri=uri) )
for entry in urlnzz.entries:
insert_sql = u"""INSERT INTO articles (title, description, date, category,
link, source) VALUES ("{e.title}", "{e.description}",
"{e.published}", "{e.category}", "{e.link}", "NZZ")
""".format(e=entry)
cur.execute(insert_sql)
db.commit()
cur.close()
db.close()
How to avoid b' and UTF-8 literals in MySQL using Python 3
The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.
That's what this code means:
str(row[3].encode("utf-8"))
If you don't want to do that, just don't do that:
row[3]
Here's an example that shows what you're doing:
>>> s = 'à'
>>> s
'à'
>>> s.encode('utf-8')
b'\xc3\xa0'
>>> str(s.encode('utf-8'))
"b'\\xc3\\xa0'"
What you want here is the first one.
More generally, calling str
on a bytes
is almost never useful. If you unavoidably have a bytes
and you need a str
, you get it by calling the decode
method. But in this case, you don't unavoidably have a bytes
. (I mean, you could write row[3].encode("utf-8").decode("utf-8")
, but that would obviously be pretty silly.)
As a side note—but a very important one—you should not be trying to str.format
your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.
In other words, instead of doing this:
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
… just do this:
"VALUES (%s, %s, %s, %s, %s, %s, %s)"
And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute
.
db.execute(q_i, (
row[0], urlparse(row[i]).netloc, row[1], row[2], article_content,
datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))
In fact, if your next to last column is—or could be—a DATETIME
column rather than a CHAR
/VARCHAR
/TEXT
/whatever, you don't even need that strftime
; just pass the datetime
object.
And notice that this means that you don't need to do anything at all to article_content
. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid "
characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.
How to encode (utf8mb4) in Python
MySQL's utf8mb4
encoding is just standard UTF-8.
They had to add that name however to distinguish it from the broken UTF-8 character set which only supported BMP characters.
In other words, from the Python side you should always encode to UTF-8 when talking to MySQL, but take into account that the database may not be able to handle Unicode codepoints beyond U+FFFF, unless you use utf8mb4
on the MySQL side.
However, generally speaking, you want to avoid manually encoding and decoding, and instead leave it to MySQLdb
worry about this. You do this by configuring your connection and your collations to handle Unicode text transparently. For MySQLdb
, that means setting charset='utf8mb4'
:
database = MySQLdb.connect(
host=hostname,
user=username,
passwd=password,
db=databasename,
charset="utf8mb4"
)
Then use normal Python 3 str
strings; leave the use_unicode
option set to it's default True
*.
Note: this handles SET NAMES
and SET character_set_connection
) for you, there is no need to issue those manually.
* Unless you still use Python 2, then the default is False
. Set it to True
and use u'...'
unicode strings.
Insert string dumped from a dictionary, which includes utf8 characters, to mysql, in Python
You need to set ensure_ascii=False
when calling json.dumps()
.
For example,
mydict_str = json.dumps(mydict, ensure_ascii=False)
Related Topics
Python Pandas: Group Datetime Column into Hour and Minute Aggregations
Installing Scipy in Python 3.5 on 32-Bit Windows 7 MAChine
String Formatting: Columns in Line
Pandas Concat Generates Nan Values
Python and Openssl Version Reference Issue on Os X
How to Concatenate Two Dataframes Without Duplicates
Matplotlib: Plotting Numerous Disconnected Line Segments with Different Colors
Extract Email Sub-Strings from Large Document
Image Segmentation Based on Edge Pixel Map
Convert Structured Array to Regular Numpy Array
How Does Python Importing Exactly Work
Equivalent of Numpy.Argsort() in Basic Python
Appending Item to Lists Within a List Comprehension
Using Multiple Python Engines (32Bit/64Bit and 2.7/3.5)
Calling Matlab Functions from Python