Urllib2 Read to Unicode

urllib2 read to Unicode

After the operations you performed, you'll see:

>>> req.headers['content-type']
'text/html; charset=windows-1251'

and so:

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

and you can search, etc, etc.

Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93

Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).

Python urllib2 and urlopen with utf-8 signs

try something along these lines, in b, you will then find an utf8 string suitable for urllib2 (you have to complete it with a meaningful location, though...). Btw, printing the decoded b will show you the §

import urllib
import urllib2
a='investments-%C2%A7-73g-legal.html'
b=urllib.unquote(a)

print (b.decode('utf8'))

urllib2.urlopen('http://localhost/' + b)

Unicode String in urllib.request

Use urllib.parse.quote:

>>> urllib.parse.quote('bär')
'b%C3%A4r'

>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'

In Python how to encode/decode unicode characters such as ö

response.read() returns a bytestring. Python shouldn't die while printing a bytestring because no character conversion occurs, bytes are printed as is.

You could try to print Unicode instead:

text = page.decode(response.info().getparam('charset') or 'utf-8')
print text

Encoding error when reading url with urllib

Apparently, urllib can only handle ASCII requests, and converting your url to ascii gives a error on your special character.
Replacing ø with %C3%B8, the proper way to encode this special character in http, seems to do the trick. However, I can't find a method to do this automatically like your browser does.

example:

>>> f="https://no.wikipedia.org/wiki/Jonas_Gahr_St%C3%B8re"
>>> import urllib.request
>>> g=urllib.request.urlopen(f)
>>> text=g.read()
>>> text[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="nb" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

The answer above doesn't work, because he is encoding after the request is processed, while you get an error during the request processing.

Reading JSON from urllib2.open and UTF8 characters

>>> import json
>>> d = {u'readme': u'Caf\xe9'}
>>> json.dumps(d)
'{"readme": "Caf\\u00e9"}'
>>> json.dumps(d, ensure_ascii=False)
'{"readme": "Café"}'

How to deal with unicode string in URL in python3?

You could use urllib.parse.quote() to encode the path section of URL.

#!/usr/bin/env python3
from urllib.parse import quote
from urllib.request import urlopen

url = 'http://zh.wikipedia.org/wiki/' + quote("毛泽东")
content = urlopen(url).read()


Related Topics



Leave a reply



Submit