How to Validate a Url in Python? (Malformed or Not)

How to validate a url in Python? (Malformed or not)

django url validation regex (source):

import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False

How to check if a URL exists or not?

There are a few ways to do this. One would be to use the validator module. It would look something like this:

import validators

valid=validators.url('https://codespeedy.com/')
if valid==True:
print("Url is valid")
else:
print("Invalid url")

Another way would be to use the requests module, like this:

import requests

try:
response = requests.get("http://www.google.com/")
print("URL is valid and exists on the internet")
except requests.ConnectionError as exception:
print("URL does not exist on Internet")

How do you validate a URL with a regular expression in Python?

An easy way to parse (and validate) URL's is the urlparse (py2, py3) module.

A regex is too much work.


There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.

Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?

These are not RFC-specified validations. These are validations unique to your application.

How to check if a string inputted by the user is a URL or plain text in Python

validate = URLValidator()

try:
validate("http://www.avalidurl.com/")
print("String is a valid URL")
except ValidationError as exception:
print("String is not valid URL")

Source

How to check if a dataframe is in the URL format or not?

Can use validators package.
If you want to know more about it, follow this link.

After getting a function which returns whether url is valid or not, you can use df.apply() and apply that function to all URLs in the dataframe. You can return ture/false for whether it's valid or not. Moreover, in the function, you can print a warning if you find it's invalid.

import validators

def isUrlValid(url):
return True if validators.url(url) else False
df['isURLValid'] = df['website'].apply(isUrlValid)

Output:

website     isURLValid
0 https://stackoverflow.com/ True
1 no False

Lastly, if you don't want to add the results as a column in a dataframe, you can loop through all values in df['website'].tolist() and call the function for each value and print warning in the function

how can I use this regular expression to validate a valid url in python

For starters you don't need the first and last slash python automatically uses them, secondly you need to make your expression a re object

import re
expr = re.compile(r'((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-]*)?\??(?:[-\+=&;%@.\w]*)#?(?:[\w]*))?)')

after that you can use the function match to see if the expression returns a valid instance

url="www.google.com"

if expr.match(url):
print("It is valid")

Verify simple human readable urls in Python - validate domain name offline

It seems that you haven't tried it all yet!
look at below example:

import re
match_cases = ['http://www.example.de', 'https://example.de/more', 'www.sub.example.de', 'example.de']

URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:de|com)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:de|com)\b/?(?!@)))"""

for text in match_cases:
url = re.findall(URL_REGEX, text)
print(url)

refer below links for more detail:

Liberal Regex Pattern for Web URLs
and few similar answers:
https://stackoverflow.com/a/44645567/7664524

https://stackoverflow.com/a/44645124/7664524


Update

In your updated question you have used url.replace("https://", ""); this will replace every https:// from url that means some of the url which contains refer link to other url will be manipulated too.

selenium says invalid url though it is not an invalid url

You need to use src = str(sheet.cell(row, 2).value)

sheet.cell(row,2) returns a cell object not string



Related Topics



Leave a reply



Submit