How to Handle Response Encoding from Urllib.Request.Urlopen() , to Avoid Typeerror: Can't Use a String Pattern on a Bytes-Like Object

How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

TypeError: can't use a string pattern on a bytes-like object in re.findall()

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').

See Convert bytes to a Python String

urllib.request.urlopen return bytes, but I cannot decode it

The content is compressed with gzip. You need to decompress it:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

If you use requests, it will uncompress automatically for you:

import requests
html = requests.get(url).text  # => str, not bytes

TypeError: cannot use a string pattern on a bytes-like object `python3.6`

pexpect can give you Unicode strings, unless you ask it to give you bytes.

If you want it to give you bytes—e.g., because you don't know the encoding the telnet server is expecting you to use—that's fine, but then you have to deal with it as bytes. That means using bytes patterns, not string patterns, in re:

for eol in [b'\r\n', b'\r', b'\n']:
    content = re.sub(b'%s$' % eol, b'', content)

But if you didn't want bytes, it's better to get everything decoded to str, and the your existing code would just work:

content = pexpect.run('ls -l', encoding='utf-8')
for eol in ['\r\n', '\r', '\n']:
    content = re.sub('%s$' % eol, '', content)

As a side note, if you're just trying to remove a final newline on the last line, it's a lot easier to do that without a regex:

content = content.rstrip('\r\n')

Or, if you're trying to do something different, like remove blank lines, even that might be better written explicitly:

content = '\n'.join(line for line in content.splitlines() if line)

… but that still leaves you with the same problem of needing to use b'\n' or '\n' appropriately, of course.

re.search().TypeError: cannot use a string pattern on a bytes-like object

re needs byte patterns (not string) to search bytes-like objects. Append a b to your search pattern like so: b'<title>(.*?)</title>'

urllib request.urlopen(url).read() with SQLAlchemy is storing a hex string instead of HTML

Ok, it looks like request.urlopen(url).read() is returning a bytes object (see Methods of File Objects.) This needs to be converted to a string with .decode('utf-8')

html = request.urlopen(url).read()
html_string = html.decode('utf-8')

also see Convert bytes to a string?

Python JSON decoding error TypeError: can't use a string pattern on a bytes-like object

In Python 3, you need to decode the bytes return value from urllib.request.urlopen() to a unicode string:

decoded = json.loads(json_input.decode('utf8'))

This makes the assumption that the web service you are using is using the default JSON encoding of UTF-8.

You could check the response for a character set if you don't want to assume:

f = urllib.request.urlopen(url)
charset = f.info().get_param('charset', 'utf8')
data = f.read()
decoded = json.loads(data.decode(charset))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 251: invalid start byte

The header implies content.decode('gb2312',errors='ignore') should work.

>>> content.find(b'charset')
226
>>> content[226:226 + 20]
b'charset=gb2312">\r\n<t'

However, your regex certainly will NOT work. You have front instead of font. Perhaps you wanted the following:

>>> pat = re.compile(r'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>'+ r'(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)

This catches the table stuff between the two pieces, as far as I can tell.

>>> txt = ''.join(pat.findall(content.decode('gb2312',errors='ignore')))
>>> print(txt[:500])

<div class="co_content8">
<ul>

<td height="220" valign="top"> <table width="100%" border="0" cellspacing="0" cellpadding="0" class="tbspan" style="margin-top:6px">
<tr> 
<td height="1" colspan="2" background="/templets/img/dot_hor.gif"></td>
</tr>
<tr> 
<td width="5%" height="26" align="center"><img src="/templets/img/item.gif" width="18" height="17"></td>
<td height="26">
    <b>

        <a href="/html/gndy/dyzz/20160920/52002.html" class="ulink">2016年井柏然杨颖《微微一笑很倾城》HD国语中字</a>
    </b>
<
>>> pat.pattern
'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> '
>>>

urlretrieve returning typeerror

It appears that urlretrieve doesn't allow the sending of headers.

And the error you're getting is because urlretrieve is expecting a URL there and not a Request object.

Since it's your own webserver you're sending the request to maybe you can modify its configuration to accept those urlretrieve requests without headers.

Best of luck.

How to Handle Response Encoding from Urllib.Request.Urlopen() , to Avoid Typeerror: Can't Use a String Pattern on a Bytes-Like Object