How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object
You just need to decode the response, using the Content-Type
header typically the last value. There is an example given in the tutorial too.
output = response.decode('utf-8')
TypeError: can't use a string pattern on a bytes-like object in re.findall()
You want to convert html (a byte-like object) into a string using .decode
, e.g. html = response.read().decode('utf-8')
.
See Convert bytes to a Python String
urllib.request.urlopen return bytes, but I cannot decode it
The content is compressed with gzip
. You need to decompress it:
import gzip
from urllib.request import Request, urlopen
req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')
If you use requests
, it will uncompress automatically for you:import requests
html = requests.get(url).text # => str, not bytes
TypeError: cannot use a string pattern on a bytes-like object `python3.6`
pexpect
can give you Unicode strings, unless you ask it to give you bytes.
If you want it to give you bytes—e.g., because you don't know the encoding the telnet
server is expecting you to use—that's fine, but then you have to deal with it as bytes. That means using bytes patterns, not string patterns, in re
:
for eol in [b'\r\n', b'\r', b'\n']:
content = re.sub(b'%s$' % eol, b'', content)
But if you didn't want bytes, it's better to get everything decoded to str
, and the your existing code would just work:content = pexpect.run('ls -l', encoding='utf-8')
for eol in ['\r\n', '\r', '\n']:
content = re.sub('%s$' % eol, '', content)
As a side note, if you're just trying to remove a final newline on the last line, it's a lot easier to do that without a regex:
content = content.rstrip('\r\n')
Or, if you're trying to do something different, like remove blank lines, even that might be better written explicitly:content = '\n'.join(line for line in content.splitlines() if line)
… but that still leaves you with the same problem of needing to use b'\n'
or '\n'
appropriately, of course. re.search().TypeError: cannot use a string pattern on a bytes-like object
re
needs byte patterns (not string) to search bytes-like objects. Append a b
to your search pattern like so: b'<title>(.*?)</title>'
urllib request.urlopen(url).read() with SQLAlchemy is storing a hex string instead of HTML
Ok, it looks like request.urlopen(url).read()
is returning a bytes
object (see Methods of File Objects.) This needs to be converted to a string with .decode('utf-8')
html = request.urlopen(url).read()
html_string = html.decode('utf-8')
also see Convert bytes to a string? Python JSON decoding error TypeError: can't use a string pattern on a bytes-like object
In Python 3, you need to decode the bytes
return value from urllib.request.urlopen()
to a unicode string:
decoded = json.loads(json_input.decode('utf8'))
This makes the assumption that the web service you are using is using the default JSON encoding of UTF-8.You could check the response for a character set if you don't want to assume:
f = urllib.request.urlopen(url)
charset = f.info().get_param('charset', 'utf8')
data = f.read()
decoded = json.loads(data.decode(charset))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 251: invalid start byte
The header implies content.decode('gb2312',errors='ignore') should work.
>>> content.find(b'charset')
226
>>> content[226:226 + 20]
b'charset=gb2312">\r\n<t'
However, your regex certainly will NOT work. You have front
instead of font
. Perhaps you wanted the following:>>> pat = re.compile(r'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>'+ r'(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)
This catches the table stuff between the two pieces, as far as I can tell.>>> txt = ''.join(pat.findall(content.decode('gb2312',errors='ignore')))
>>> print(txt[:500])
<div class="co_content8">
<ul>
<td height="220" valign="top"> <table width="100%" border="0" cellspacing="0" cellpadding="0" class="tbspan" style="margin-top:6px">
<tr>
<td height="1" colspan="2" background="/templets/img/dot_hor.gif"></td>
</tr>
<tr>
<td width="5%" height="26" align="center"><img src="/templets/img/item.gif" width="18" height="17"></td>
<td height="26">
<b>
<a href="/html/gndy/dyzz/20160920/52002.html" class="ulink">2016年井柏然杨颖《微微一笑很倾城》HD国语中字</a>
</b>
<
>>> pat.pattern
'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> '
>>>
urlretrieve returning typeerror
It appears that urlretrieve
doesn't allow the sending of headers.
And the error you're getting is because urlretrieve
is expecting a URL there and not a Request
object.
Since it's your own webserver you're sending the request to maybe you can modify its configuration to accept those urlretrieve
requests without headers.
Best of luck.
Related Topics
Why "Numpy.Any" Has No Short-Circuit Mechanism
How to Check If Two Strings Are Anagrams of Each Other
Matplotlib: Finding Out Xlim and Ylim After Zoom
Binary Numpy Array to List of Integers
Python Decompression Relative Performance
Python Argparse: Default Value or Specified Value
How to Have Shared Log Files Under Windows
Python - Datetime with Timezone to Epoch
"Pip Install --Editable ./" VS "Python Setup.Py Develop"
Powersets in Python Using Itertools
How to Query Multiindex Index Columns Values in Pandas
Python MySQL Connector - Unread Result Found When Using Fetchone
How Does Sklearn.Svm.Svc's Function Predict_Proba() Work Internally
Why Results of Map() and List Comprehension Are Different
How to Avoid Infinite Recursion with Super()
Pandas Read_CSV and Filter Columns with Usecols
Spark Dataframe: Computing Row-Wise Mean (Or Any Aggregate Operation)