Url decode UTF-8 in Python
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote()
, which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote()
, but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
Decoding UTF-8 to URL with Python
After some tests, I can confirm that the server accepts the URL in different formats:
raw utf8 encoded URL:
url_output = url_input.encode('utf8')
%encoded latin1 URL
url_output = urllib.quote_plus(url_input.encode('latin1'), '/:')
%encoded utf8 URL
url_output = urllib.quote_plus(url_input.encode('utf8'), '/:')
As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. It gives:
print url_output
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955
Decoding UTF-8 URL in Python
Update: If the output file is a yaml document then you could ignore \u0163
in it. Unicode escapes are valid in yaml documents.
#!/usr/bin/env python3
import json
# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"
Note: no \u
in the last case. Both lines represent the same Python string.
yaml.dump()
has similar option: allow_unicode
. Set it to True
to avoid Unicode escapes.
The url is correct. You don't need to do anything with it:
#!/usr/bin/env python3
from urllib.parse import unquote
url = "pe%20to%C5%A3i%20mai"
text = unquote(url)
with open('some_file', 'w', encoding='utf-8') as file:
def p(line):
print(line, file=file) # write line to file
p(text) # -> pe toţi mai
p(repr(text)) # -> 'pe toţi mai'
p(ascii(text)) # -> 'pe to\u0163i mai'
p("pe to\u0163i mai") # -> pe toţi mai
p(r"pe to\u0163i mai") # -> pe to\u0163i mai
#NOTE: r'' prefix
The \u0163
sequence might be introduced by character encoding error handler:
with open('some_other_file', 'wb') as file: # write bytes
file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai
Or:
with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
file.write(text) # -> pe to\u0163i mai
More examples:
# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ
Decode escaped characters in URL
Using urllib
package (import urllib
) :
Python 2.7
From official documentation :
urllib.unquote(string)
Replace
%xx
escapes by their single-character equivalent.Example:
unquote('/%7Econnolly/')
yields'/~connolly/'
.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example:
unquote('/El%20Ni%C3%B1o/')
yields'/El Niño/'
.
How to decode a (doubly) 'url-encoded' string in python
Your input is encoded double. Using Python 3:
urllib.parse.unquote(urllib.parse.unquote(some_string))
Output:
'FireShot3+(2).png'
now you have the +
left.
Edit:
Using Python 2.7, it would need to be:
urllib.unquote(urllib.unquote('FireShot3%2B%25282%2529.png'))
invalid start byte while decoding url to utf-8-sig
The start of the file you show does appear to be a UTF-8 version of the unicode byte order mark, so your decoding approach is correct. Apparently the rest of the file contains invalid utf-8 somewhere. Since you don't control the quality of the input you are scraping, you could suppress the error like this so you can carry on:
text = urlopen(req).read().decode("utf-8-sig", errors="replace")
This will replace problem areas with a special symbol, so you can see where the problem arose. Or use errors="ignore"
to make them just go away.
Fetching URL and converting to UTF-8 Python
The data you read from a urlopen()
response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None
; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.
Decoding string to UTF-8 (URL/Percent-encoding strings)
These strings are Percent-encoded. Use the urllib.parse module to decode them:
import urllib.parse
s = "%C5%93ufs"
s = urllib.parse.unquote(s)
UTF-8 decoding doesn't decode special characters in python
\u00e1
is another way of representing the á
character when displaying the contents of a Python string.
If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"})
you'll see output of {'Product': 'Tábua 21X40'}
. The \u00e1
doesn't exist in the string as those individual characters.
The \u
escape sequence indicates that the following numbers specify a Unicode character.
Attempting to replace \u00e1
with á
won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á")
is attempting to replace the individual characters of a slash, a u
, etc and, as mentioned, they don't actually exist in the string in that way.
If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.
Related Topics
Activate a Virtualenv with a Python Script
How to Hide Console Window in Python
Generating Permutations with Repetitions
In Pandas, Is Inplace = True Considered Harmful, or Not
Using Lambda Expression to Connect Slots in Pyqt
Create Pandas Dataframe from Txt File with Specific Pattern
Python Analog of PHP's Natsort Function (Sort a List Using a "Natural Order" Algorithm)
How to State in Requirements.Txt a Direct Github Source
What's the Purpose of "Send" Function on Python Generators
Shared-Memory Objects in Multiprocessing
Return a Default Value If a Dictionary Key Is Not Available
How to Format a Decimal to Always Show 2 Decimal Places
How to Check If All Elements of a List Match a Condition
How to Send a Head Http Request in Python 2
"Fire and Forget" Python Async/Await