Canonicalize / normalize a URL?
Following the good start, I composed a method that fits most of the cases commonly found in the web.
def urlnorm(base, link=''):
'''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
new = urlparse(urljoin(base, url).lower())
return urlunsplit((
new.scheme,
(new.port == None) and (new.hostname + ":80") or new.netloc,
new.path,
new.query,
''))
Function in Python to clean up and normalize a URL
Take a look at urlparse.urlparse()
. I've had good success with it.
note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse
module has been named to urllib.parse
. The corresponding Python3 documentation for urllib.parse
can be found here:
https://docs.python.org/3/library/urllib.parse.html
How can I normalize/collapse paths or URLs in Python in OS independent way?
Here is how to do it
>>> import urlparse
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/", "../..")
'ftp://domain.com/a/b/'
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/e.txt", "../..")
'ftp://domain.com/a/b/'
Remember that urljoin
consider a path/directory all until the last /
- after this is the filename, if any.
Also, do not add a leading /
to the second parameter, otherwise you will not get the expected result.
os.path
module is platform dependent but for file paths using only slashes but not-URLs you could use posixpath,normpath
.
How to normalize URL and disregard anything after the slash?
You can use URLlib module in python
from urllib3.util import parse_url
urls = ['foo.com','www.foo.com/','foo.com/us','foo.com/ca/example-test/']
for url in urls:
parsed_url = parse_url(url)
host = parsed_url.host if not parsed_url.host.startswith('www.') else parsed_url.host.lstrip('www.')
Output will be as you expected.
Cleaning URLs and saving them to txt file Python3
That is because you removing items from the list while iterating over it, which is a bad thing to do, you could either create another list that has the new values and append to it, or modify the list in-place using indexing, you could also just use a list comprehension for this task:
content = [item if item.startswith(url_format) else re.sub(r'.*google', url_format, item) for item in content]
Or, using another list:
new_content = []
for item in content:
if item.startswith(url_format):
new_content.append(item)
else:
new_content.append(re.sub(r'.*google', url_format, item))
Or, modifying the list in-place, using indexing:
for i, item in enumerate(content):
if not item.startswith(url_format):
content[i] = re.sub(r'.*google', url_format, item)
Python 3 clean and normalize URL list
Since strings are immutable in python we can't change alphabets in them but make new strings, hence the slight complication. First we remove the http
elements. Then we check if www
is present in the link or not. If not we replace the country code(two alphabets) with www
list1 = ['http://www.google.com/images', 'https://ca.google.com/images','https://www.google.com/images','http://uk.google.com/images',
'https://www.google.com/images']
list1 = [item.replace('http://', 'https://') for item in list1]
for item in list1:
if not 'www' in item:
old_item = item
v = str(item[8:10])
new_item = item.replace(v, 'www')
list1.append(new_item)
list1.remove(old_item)
print(list1)
Output:['https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images', 'https://www.google.com/images']
URL parsing in Python - normalizing double-slash in paths
If you only want to get the url without the query part, I would skip the urlparse module and just do:
testUrl.rsplit('?')
The url will be at index 0 of the list returned and the query at index 1.
It is not possible to have two '?' in an url so it should work for all urls.
Normalizing HTML content when requesting URL with Requests
Use response.text
instead of response.content
– as noted in the Requests documentation quoted below, this will decode the response content to a Unicode string, using the encoding information provided by the HTTP response:
content
Content of the response, in bytes.
text
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.
Example:
import requests
url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.text)
Output:
<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<head>
<title>Trump Offers No Apology for Claim on British Spying - The New York Times</title>
<!-- etc ... -->
</body>
</html>
Related Topics
How to Get Dict from SQLite Query
How Do Chained Comparisons in Python Actually Work
When Should I Be Using Classes in Python
Access Data in Package Subdirectory
Is Generator.Next() Visible in Python 3
Add Column with Number of Days Between Dates in Dataframe Pandas
Django What Is Reverse Relationship
Replace Column Values Based on Another Dataframe Python Pandas - Better Way
Convert Pandas Series to Dataframe
Reset Color Cycle in Matplotlib
Why Is 'Self' in Python Objects Immutable
Pyinstaller Is Not Recognized as Internal or External Command
Convert Dictionary Entries into Variables
How to Melt 2 Columns at the Same Time
How to Run Function in a Subprocess Without Threading or Writing a Separate File/Script