Url Decode Utf-8 in Python

Url decode UTF-8 in Python

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')

Decoding UTF-8 to URL with Python

After some tests, I can confirm that the server accepts the URL in different formats:

  • raw utf8 encoded URL:

    url_output = url_input.encode('utf8')
  • %encoded latin1 URL

    url_output = urllib.quote_plus(url_input.encode('latin1'), '/:')
  • %encoded utf8 URL

    url_output = urllib.quote_plus(url_input.encode('utf8'), '/:')

As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. It gives:

    print url_output

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955

Decoding UTF-8 URL in Python

Update: If the output file is a yaml document then you could ignore \u0163 in it. Unicode escapes are valid in yaml documents.

#!/usr/bin/env python3
import json

# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"

Note: no \u in the last case. Both lines represent the same Python string.

yaml.dump() has similar option: allow_unicode. Set it to True to avoid Unicode escapes.


The url is correct. You don't need to do anything with it:

#!/usr/bin/env python3
from urllib.parse import unquote

url = "pe%20to%C5%A3i%20mai"
text = unquote(url)

with open('some_file', 'w', encoding='utf-8') as file:
def p(line):
print(line, file=file) # write line to file

p(text) # -> pe toţi mai
p(repr(text)) # -> 'pe toţi mai'
p(ascii(text)) # -> 'pe to\u0163i mai'

p("pe to\u0163i mai") # -> pe toţi mai
p(r"pe to\u0163i mai") # -> pe to\u0163i mai
#NOTE: r'' prefix

The \u0163 sequence might be introduced by character encoding error handler:

with open('some_other_file', 'wb') as file: # write bytes
file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai

Or:

with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
file.write(text) # -> pe to\u0163i mai

More examples:

# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ

Decode escaped characters in URL

Using urllib package (import urllib) :

Python 2.7

From official documentation :

urllib.unquote(string)

Replace %xx escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

Python 3

From official documentation :

urllib.parse.unquote(string, encoding='utf-8', errors='replace')

[…]

Example: unquote('/El%20Ni%C3%B1o/') yields '/El Niño/'.

How to decode a (doubly) 'url-encoded' string in python

Your input is encoded double. Using Python 3:

urllib.parse.unquote(urllib.parse.unquote(some_string))

Output:

'FireShot3+(2).png'

now you have the + left.

Edit:

Using Python 2.7, it would need to be:

urllib.unquote(urllib.unquote('FireShot3%2B%25282%2529.png'))

invalid start byte while decoding url to utf-8-sig

The start of the file you show does appear to be a UTF-8 version of the unicode byte order mark, so your decoding approach is correct. Apparently the rest of the file contains invalid utf-8 somewhere. Since you don't control the quality of the input you are scraping, you could suppress the error like this so you can carry on:

text = urlopen(req).read().decode("utf-8-sig", errors="replace")

This will replace problem areas with a special symbol, so you can see where the problem arose. Or use errors="ignore" to make them just go away.

Fetching URL and converting to UTF-8 Python

The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.

You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:

test = page.read().decode('utf8')

However, it is up to the server to tell you what data was received. Check for a characterset in the headers:

encoding = page.info().getparam('charset')

This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.

You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.

Decoding string to UTF-8 (URL/Percent-encoding strings)

These strings are Percent-encoded. Use the urllib.parse module to decode them:

import urllib.parse

s = "%C5%93ufs"
s = urllib.parse.unquote(s)

UTF-8 decoding doesn't decode special characters in python

\u00e1 is another way of representing the á character when displaying the contents of a Python string.

If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.

The \u escape sequence indicates that the following numbers specify a Unicode character.

Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.

If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.



Related Topics



Leave a reply



Submit