How to Normalize a Url in Python

Canonicalize / normalize a URL?

Following the good start, I composed a method that fits most of the cases commonly found in the web.

def urlnorm(base, link=''):
  '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
  new = urlparse(urljoin(base, url).lower())
  return urlunsplit((
    new.scheme,
    (new.port == None) and (new.hostname + ":80") or new.netloc,
    new.path,
    new.query,
    ''))

Function in Python to clean up and normalize a URL

Take a look at urlparse.urlparse(). I've had good success with it.

note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html

How can I normalize/collapse paths or URLs in Python in OS independent way?

Here is how to do it

>>> import urlparse
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/", "../..")
'ftp://domain.com/a/b/'
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/e.txt", "../..")
'ftp://domain.com/a/b/'

Remember that urljoin consider a path/directory all until the last / - after this is the filename, if any.

Also, do not add a leading / to the second parameter, otherwise you will not get the expected result.

os.path module is platform dependent but for file paths using only slashes but not-URLs you could use posixpath,normpath.

How to normalize URL and disregard anything after the slash?

You can use URLlib module in python

from urllib3.util import parse_url

urls = ['foo.com','www.foo.com/','foo.com/us','foo.com/ca/example-test/']
for url in urls:
   parsed_url = parse_url(url)
   host = parsed_url.host if not parsed_url.host.startswith('www.') else parsed_url.host.lstrip('www.')

Output will be as you expected.

Cleaning URLs and saving them to txt file Python3

That is because you removing items from the list while iterating over it, which is a bad thing to do, you could either create another list that has the new values and append to it, or modify the list in-place using indexing, you could also just use a list comprehension for this task:

content = [item if item.startswith(url_format) else re.sub(r'.*google', url_format, item) for item in content]

Or, using another list:

new_content = []

for item in content:
    if item.startswith(url_format):
        new_content.append(item)
    else:
        new_content.append(re.sub(r'.*google', url_format, item))

Or, modifying the list in-place, using indexing:

for i, item in enumerate(content):
    if not item.startswith(url_format):
        content[i] = re.sub(r'.*google', url_format, item)

Python 3 clean and normalize URL list

Since strings are immutable in python we can't change alphabets in them but make new strings, hence the slight complication. First we remove the http elements. Then we check if www is present in the link or not. If not we replace the country code(two alphabets) with www

list1 = ['http://www.google.com/images', 'https://ca.google.com/images','https://www.google.com/images','http://uk.google.com/images',
'https://www.google.com/images']
list1 = [item.replace('http://', 'https://') for item in list1]
for item in list1:
    if not 'www' in item:
        old_item = item
        v = str(item[8:10])
        new_item = item.replace(v, 'www')
        list1.append(new_item)
        list1.remove(old_item)

print(list1)

Output:
['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images']

URL parsing in Python - normalizing double-slash in paths

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

The url will be at index 0 of the list returned and the query at index 1.

It is not possible to have two '?' in an url so it should work for all urls.

Normalizing HTML content when requesting URL with Requests

Use response.text instead of response.content – as noted in the Requests documentation quoted below, this will decode the response content to a Unicode string, using the encoding information provided by the HTTP response:

content

Content of the response, in bytes.

text

Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using chardet.

The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

Example:

import requests
url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.text)

Output:

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle"  itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<head>
    <title>Trump Offers No Apology for Claim on British Spying - The New York Times</title>
      <!-- etc ... -->
</body>
</html>