urllib2 read to Unicode
After the operations you performed, you'll see:
>>> req.headers['content-type']
'text/html; charset=windows-1251'
and so:
>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)
ucontent
is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:
>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>
and you can search, etc, etc.
Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:
>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93
Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).
Python urllib2 and urlopen with utf-8 signs
try something along these lines, in b, you will then find an utf8 string suitable for urllib2 (you have to complete it with a meaningful location, though...). Btw, printing the decoded b will show you the §
import urllib
import urllib2
a='investments-%C2%A7-73g-legal.html'
b=urllib.unquote(a)
print (b.decode('utf8'))
urllib2.urlopen('http://localhost/' + b)
Unicode String in urllib.request
Use urllib.parse.quote
:
>>> urllib.parse.quote('bär')
'b%C3%A4r'
>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'
In Python how to encode/decode unicode characters such as ö
response.read()
returns a bytestring. Python shouldn't die while printing a bytestring because no character conversion occurs, bytes are printed as is.
You could try to print Unicode instead:
text = page.decode(response.info().getparam('charset') or 'utf-8')
print text
Encoding error when reading url with urllib
Apparently, urllib can only handle ASCII requests, and converting your url to ascii gives a error on your special character.
Replacing ø with %C3%B8, the proper way to encode this special character in http, seems to do the trick. However, I can't find a method to do this automatically like your browser does.
example:
>>> f="https://no.wikipedia.org/wiki/Jonas_Gahr_St%C3%B8re"
>>> import urllib.request
>>> g=urllib.request.urlopen(f)
>>> text=g.read()
>>> text[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="nb" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'
The answer above doesn't work, because he is encoding after the request is processed, while you get an error during the request processing.
Reading JSON from urllib2.open and UTF8 characters
>>> import json
>>> d = {u'readme': u'Caf\xe9'}
>>> json.dumps(d)
'{"readme": "Caf\\u00e9"}'
>>> json.dumps(d, ensure_ascii=False)
'{"readme": "Café"}'
How to deal with unicode string in URL in python3?
You could use urllib.parse.quote() to encode the path section of URL.
#!/usr/bin/env python3
from urllib.parse import quote
from urllib.request import urlopen
url = 'http://zh.wikipedia.org/wiki/' + quote("毛泽东")
content = urlopen(url).read()
Related Topics
Python: Sort Function Breaks in the Presence of Nan
How to Tell a Python Script to Use a Particular Version
How Do Chained Comparisons in Python Actually Work
Python Time + Timedelta Equivalent
How to Print a Dictionary's Key
Pip Broke. How to Fix Distributionnotfound Error
Beautifulsoup:Difference Between .Find() and .Select()
For Loops and Iterating Through Lists
How to Highlight Specific X-Value Ranges
Convert Bytes to Bits in Python
Split an Integer into Digits to Compute an Isbn Checksum
How to Extract Info Within a #Shadow-Root (Open) Using Selenium Python
Convert Alphabet Letters to Number in Python
Combining Two Series into a Dataframe in Pandas
Add Column with Number of Days Between Dates in Dataframe Pandas