You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').

Python 3 distinguishes "bytes" and "string" types; this is especially important for Unicode strings, where each character may be more than one byte, depending on the character and the encoding.

Regular expressions can work on either, but it has to be consistent — searching for bytes within bytes, or strings within strings.

Depending on what you need, there are two solutions:

  • Decode the output variable before searching in it; for instance, with: output_text = output.decode('utf-8')

    This depends on the encoding that you are using; UTF-8 is the most common these days.

    The matched group will be a string.

  • Search with bytes by adding a b prefix to the regular expression. A regular expression should also use the r prefix, so it becomes: re.search(br"(Profile\s*:\s)(.*)", output)

    The matched group will be a bytes object.

The problem is that you're mixing bytes and text strings. You should either decode your data into a text string (unicode), e.g. data.decode('utf-8'), or use a bytes object for the pattern, e.g. re.findall(b"[A-Za-z]") (note the leading b before the string literal).

re needs byte patterns (not string) to search bytes-like objects. Append a b to your search pattern like so: b'<title>(.*?)</title>'

On Python 3 sys.argv is a list of str. However, Agent.request accepts a value of type bytes as its 2nd argument. Since sys.argv[1] is a value of type str something goes wrong somewhere in the implementation and you get this obscure exception.

If you encode sys.argv[1] to bytes (eg sys.argv[1].encode("ascii")) and pass the result to agent.request then you'll get past this error.

response.text will give you a str, not bytes but response.content will give you bytes.

Choose the type you want to use and use it consistently.

re will handle bytes if the regular expression is bytes as well.

return re.findall('(?:href=")(.*?)"', response.content)

response.content in this case is of type binary. So either you use response.text, so you get pure text and can process it as you plan on doing now, or you can check this out:

In case you want to continue down the binary road.


