How to Fetch a Non-Ascii Url with Urlopen

How to fetch a non-ascii url with urlopen?

Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.

To convert an IRI to a plain ASCII URI:

  • non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;

  • non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.

So:

import re, urlparse

def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)

def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)

>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'

(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass@ prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)

Unable to parse Non ASCII URL on Python?

I think you skipped over a little bit of the docs. Try this instead:

# coding=UTF-8

import json
import urllib

service_url = "https://www.googleapis.com/freebase/v1/mqlread"
query = [{
'/sports/pro_athlete/teams': [
{
'to': None,
'optional': True,
'mid': None,
'team': None
}
],
'name': 'José Mourinho'
}]

url = service_url + '?' + urllib.urlencode({'query': json.dumps(query)})
response = json.loads(urllib.urlopen(url).read())

print response

Rather than building the query string yourself, use json.dumps and urllib.urlencode to create it for you. They're good at this.

Note: if you can use the requests package, that last bit could be:

import requests
response = requests.get(service_url, params={'query': json.dumps(query)})

Then you get to skip the URL construction and escaping altogether!

Passing a non ascii character in steam url

You code seems to work fine:

>>> base_url = 'http://steamcommunity.com/market/priceoverview/?appid=730¤cy=1&market_hash_name={}'
>>> urllib.request.urlopen(base_url.format(quote("StatTrak™ Dual Berettas | Panther (Factory New)"))).read()
b'{"success":true,"lowest_price":"$5.80","volume":"3","median_price":"$4.53"}'

I believe your issue arises because of double quoting. The %2520 in your "Output URL" means you quoted a single space (%20) twice (' ' -> %20 -> %2520).

Your code however seems to quote only once, which is good. You must have passed the item already quoted to the function.

How to request a url with non-unicode carachters on main domainname (not params) in Python?

What you need is u"http://www.besondere-raumdüfte.de/".encode('idna'). Please note how the source string is a Unicode constant (the u prefix).

The result is an URL usable with urlopen().

If you have a domain name with non-ASCII characters and the rest of the URL contains non-ASCII characters, you need to .encode('idna') the domain part and iri2uri() the rest.

Fetching URL and converting to UTF-8 Python

The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.

You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:

test = page.read().decode('utf8')

However, it is up to the server to tell you what data was received. Check for a characterset in the headers:

encoding = page.info().getparam('charset')

This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.

You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.

Can't open Unicode URL with Python

Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:

"...Only alphanumerics [0-9a-zA-Z],
the special characters "$-_.+!*'(),"
[not including the quotes - ed], and
reserved characters used for their
reserved purposes may be used
unencoded within a URL."

As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.

How i can solve problem with encoding python file

The error is due to non-ASCII URL, use urllib.parse.quote to append the non-ASCII to the url:

from urllib.request import urlopen
from bs4 import BeautifulSoup

from urllib.parse import quote

html_doc = urlopen("https://yandex.ru/images/search?from=tabbar&text=" + quote("яблоко"))
soup = BeautifulSoup(html_doc)
print(html_doc)

for img in soup.find_all('img'):
print(img.get("src"))

Output:

//im0-tub-ru.yandex.net/i?id=9550e470e4d75936eaab6bc78263d930&n=13
//im0-tub-ru.yandex.net/i?id=360c840e56e79e44037ee00e38d6c284&n=13
//im0-tub-ru.yandex.net/i?id=edae1e226592942278a0f7896ce98bdb&n=13
//im0-tub-ru.yandex.net/i?id=488807d3cee9a40c4c354ae733aa6c6a&n=13
//im0-tub-ru.yandex.net/i?id=806e10cc5c196e54a91e26d905b58636&n=13
//im0-tub-ru.yandex.net/i?id=66d4f6b8993a267504cc23ec9426f226&n=13
//im0-tub-ru.yandex.net/i?id=7da7b1f80ecc6f9f7137fcf0b61683c8&n=13
//im0-tub-ru.yandex.net/i?id=73b4d42a5f5e66be1ad5d0f599fbaf7c&n=13
//im0-tub-ru.yandex.net/i?id=ec3ee01852df27476594dfefc6364883&n=13
//im0-tub-ru.yandex.net/i?id=b1a08c606f5078f9cf4baad4fe8e459a&n=13
//im0-tub-ru.yandex.net/i?id=8fe235f57b55688b95f9d38d04fcb5d7&n=13
//im0-tub-ru.yandex.net/i?id=77235497f940c7d0e4d799319c8df5b1&n=13
//im0-tub-ru.yandex.net/i?id=19b1e58076c8ce51fd029eae0d1d7e7a&n=13
//im0-tub-ru.yandex.net/i?id=3adb56db0e22ae7fb5038633de5318b0&n=13
//im0-tub-ru.yandex.net/i?id=8e0acb9ca7b4e78ad97f8f01343129a2&n=13
//im0-tub-ru.yandex.net/i?id=fabac207fce051cd562bfcadc118d602&n=13
//im0-tub-ru.yandex.net/i?id=2bd88b1eb70d70508417439d3538d10d&n=13
//im0-tub-ru.yandex.net/i?id=47d3c4339d9a17392317b2de98a9ae23&n=13
//im0-tub-ru.yandex.net/i?id=50a21edf30c7736d44f5f3a327111ae0&n=13
//im0-tub-ru.yandex.net/i?id=1db8cd36e1333d7376bf91c8f797ce8e&n=13
//im0-tub-ru.yandex.net/i?id=c48ae350238da4cdb758dd1f1b7b0c9d&n=13
//im0-tub-ru.yandex.net/i?id=bfbf39648f7379ee23147a4d42a506fb&n=13
//im0-tub-ru.yandex.net/i?id=d9bade6482e5c015a28e85ca544d07fb&n=13
//im0-tub-ru.yandex.net/i?id=2126465dbbf5b405f50cdead47fe4ac8&n=13
//im0-tub-ru.yandex.net/i?id=10ae5a46ea6efaa4ca35040ba948df7c&n=13
//im0-tub-ru.yandex.net/i?id=d3869cd0412cf274954a8297c088002e&n=13
//im0-tub-ru.yandex.net/i?id=bbeadc9712b4978dfce7658f49692d5c&n=13
//im0-tub-ru.yandex.net/i?id=8ffdeec38792161950574eb16efc546f&n=13
//im0-tub-ru.yandex.net/i?id=65c6cd6b1094055c69b89251b0cbf150&n=13
//im0-tub-ru.yandex.net/i?id=a8c22542d8eadb335222d4b037cc7b74&n=13
//im0-tub-ru.yandex.net/i?id=b640fc2198c8731a03a435a504517b9f&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=93eb2b406a2e1e12f76484d147b19bfc&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=2eea3544a66acd4c96736cffe49d6252&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=123d340cf753b45d18050b28904ece08&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=016dcb736320a4367d8234fa746e06bb&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=b2813050d610631d93c8426ac3e9fc62&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=456ca4948973b44ded169b1ae2d3a888&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=d3f6660fca5da9413b013cbb92c91ab6&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=34e1a56bb94e872e05909c64e8d298c7&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=fe0ffea6ca299e5ba8bb435552866508&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=bd4994eecf708012fe37cb5e78411a96&n=11&ref=rq
//im0-tub-ru.yandex.net/i?id=2b6893a9f6b99d4fbeb5ca92b292af8a&n=11&ref=rq
//mc.yandex.ru/watch/722889

problem of urlretrieve cannot get image from url contains unicode string

The URL contains a non-ASCII character (a Cyrillic letter that looks like a Latin "c").

Escape this character using the urllib.parse.quote function:

url = 'https://uploads0.wikiart.org' + urllib.parse.quote('/images/albrecht-durer/watermill-at-the-montaсa.jpg')
urllib.request.urlretrieve(url, '/tmp/watermill.jpg')

Don't put the entire URL in the quote function, otherwise it would escape the colon (":") in "https://".

in python, UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-18: inal not in range(128)

If you want a simple fix, use the requests module instead of urllib. It implicitly transforms Unicode urls, so you don't have to.

from bs4 import BeautifulSoup as soup
import requests

page_url = "url.txt"

with open("url.txt", "r") as fr:
for url in map(lambda x: x.strip(), fr.readlines()):
print(url)
response = requests.get(url)
page_soup = soup(response.text, "html.parser")


Related Topics



Leave a reply



Submit