Writing Utf-8 String to MySQL with Python

Writing UTF-8 String to MySQL with Python

I found the solution to my problems. Decoding the String with .decode('unicode_escape').encode('iso8859-1').decode('utf8') did work at last. Now everything is inserted as it should. The full other solution can be found here: Working with unicode encoded Strings from Active Directory via python-ldap

How to fetch and print utf-8 data from mysql DB using Python?

Your problem is with your terminal (sys.stdout) encoding (cf http://en.wikipedia.org/wiki/Code_page_862), which depends on your system's settings. The best solution (as explained here : https://stackoverflow.com/a/15740694/41316) is to explicitely encode your unicode data before printing them to sys.stdout.

If you can't use a more usable encoding (utf-8 comes to mind, as it has been designed to handle all unicode characters), you can at least use an alternative error handling like "replace" (replaces non-encodable characters with '?') or "ignore" (suppress non-encodable characters).

Here's a corrected version of your code, you can play with the encoding and on_error settings to find out what solution works for you:

import sys
import MySQLdb

# set desired output encoding here
# it looks like your default encoding is "cp862"
# but you may want to first try 'utf-8' first
# encoding = "cp862"
encoding = "utf-8"

# what do when we can't encode to the desired output encoding
# options are:
# - 'strict' : raises a UnicodeEncodeError (default)
# - 'replace': replaces missing characters with '?'
# - 'ignore' : suppress missing characters
on_error = "replace"

db = MySQLdb.connect(
"localhost","matan","pass","youtube",
charset='utf8',
use_unicode=True
)
cursor = db.cursor()
sql = "SELECT * FROM VIDEOS"
try:
cursor.execute(sql)
for i, row in enumerate(cursor):
try:
# encode unicode data to the desired output encoding
title = row[0].encode(encoding, on_error)
link = row[1].encode(encoding, on_error)
except UnicodeEncodeError as e:
# only if on_error='strict'
print >> sys.stderr, "failed to encode row #%s - %s" % (i, e)
else:
print "title=%s\nlink=%s\n\n" % (title, link))
finally:
cursor.close()
db.close()

NB : you may also want to read this (specially the comments) http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/ for more on Python, strings, unicode, encoding, sys.stdout and terminal issues.

How to correctly insert utf-8 characters into a MySQL table using python

Did you try, this query set names utf8;

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb

mystring = "Bientôt l'été"

myinsert = [{ "name": mystring.encode("utf-8").strip()[:65535], "id": 1 }]

con = MySQLdb.connect('localhost', 'abc', 'def', 'ghi');
cur = con.cursor()

cur.execute("set names utf8;") # <--- add this line,

sql = "INSERT INTO 'MyTable' ( 'my_id', 'my_name' ) VALUES ( %(id)s, %(name)s ) ; "
cur.executemany( sql, myinsert )
con.commit()
if con: con.close()

UnicodeEncodeError when trying to insert UTF-8 data into Mysql

The major issue is that you're calling str() on Unicode objects. Depending on many factors, this may result in Python trying to encode the Unicode into ASCII, which is not possible with non-ASCII chars.

You should try to keep Unicode objects as Unicode objects for as long as possible in your code and only convert when it's totally necessary. Fortunately, the MySQL driver is Unicode compliant, so you can pass it Unicode strings and it will encode internally. The only thing you need to do is to tell the driver to use UTF-8. Feedparser is also Unicode compliant and is decoding the rss feed automatically to Unicode strings (strings without encoding).

There's also some parts of your code, which would benefit from using Python's in built features like for each in something:, String.format(), and triple quotes (""") for long pieces of text.

Pulling this all together looks like:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import MySQLdb
import feedparser

db = MySQLdb.connect(host="127.0.0.1",
user="root",
passwd="",
db="FeedStuff",
charset='UTF8')

urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']

cur = db.cursor()

for uri in urllistbernerz:
urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/{uri}/rss.html'.format(uri=uri))

for entry in urlbernerz.entries:
insert_sql = u"""INSERT INTO articles (title, description, date, category,
link, source) VALUES ("{e.title}", "{e.description}",
"{e.published}", "{e.category}", "{e.link}", "Berner Zeitung")
""".format(e=entry)

cur.execute(insert_sql)

for uri in urllistnzz:
urlnzz = feedparser.parse('http://www.nzz.ch/{uri}.rss'.format(uri=uri) )

for entry in urlnzz.entries:
insert_sql = u"""INSERT INTO articles (title, description, date, category,
link, source) VALUES ("{e.title}", "{e.description}",
"{e.published}", "{e.category}", "{e.link}", "NZZ")
""".format(e=entry)

cur.execute(insert_sql)

db.commit()

cur.close()
db.close()

How to avoid b' and UTF-8 literals in MySQL using Python 3

The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.

That's what this code means:

str(row[3].encode("utf-8"))

If you don't want to do that, just don't do that:

row[3]

Here's an example that shows what you're doing:

>>> s = 'à'
>>> s
'à'
>>> s.encode('utf-8')
b'\xc3\xa0'
>>> str(s.encode('utf-8'))
"b'\\xc3\\xa0'"

What you want here is the first one.

More generally, calling str on a bytes is almost never useful. If you unavoidably have a bytes and you need a str, you get it by calling the decode method. But in this case, you don't unavoidably have a bytes. (I mean, you could write row[3].encode("utf-8").decode("utf-8"), but that would obviously be pretty silly.)


As a side note—but a very important one—you should not be trying to str.format your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.

In other words, instead of doing this:

"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)

… just do this:

"VALUES (%s, %s, %s, %s, %s, %s, %s)"

And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute.

db.execute(q_i, (
row[0], urlparse(row[i]).netloc, row[1], row[2], article_content,
datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))

In fact, if your next to last column is—or could be—a DATETIME column rather than a CHAR/VARCHAR/TEXT/whatever, you don't even need that strftime; just pass the datetime object.

And notice that this means that you don't need to do anything at all to article_content. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid " characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.

How to encode (utf8mb4) in Python

MySQL's utf8mb4 encoding is just standard UTF-8.

They had to add that name however to distinguish it from the broken UTF-8 character set which only supported BMP characters.

In other words, from the Python side you should always encode to UTF-8 when talking to MySQL, but take into account that the database may not be able to handle Unicode codepoints beyond U+FFFF, unless you use utf8mb4 on the MySQL side.

However, generally speaking, you want to avoid manually encoding and decoding, and instead leave it to MySQLdb worry about this. You do this by configuring your connection and your collations to handle Unicode text transparently. For MySQLdb, that means setting charset='utf8mb4':

database = MySQLdb.connect(
host=hostname,
user=username,
passwd=password,
db=databasename,
charset="utf8mb4"
)

Then use normal Python 3 str strings; leave the use_unicode option set to it's default True*.

Note: this handles SET NAMES and SET character_set_connection) for you, there is no need to issue those manually.


* Unless you still use Python 2, then the default is False. Set it to True and use u'...' unicode strings.

Insert string dumped from a dictionary, which includes utf8 characters, to mysql, in Python

You need to set ensure_ascii=False when calling json.dumps().

For example,

mydict_str = json.dumps(mydict, ensure_ascii=False)


Related Topics



Leave a reply



Submit