How to Validate a Url with a Regular Expression in Python

How do you validate a URL with a regular expression in Python?

An easy way to parse (and validate) URL's is the urlparse (py2, py3) module.

A regex is too much work.


There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.

Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?

These are not RFC-specified validations. These are validations unique to your application.

Use custom regex to validate linkedIn url

With your shown samples only, could you please try following regex. Online demo of regex is: Online demo of regex

^http[s]?:\/\/www\.linkedin\.com\/(in|pub|public-profile\/in|public-profile\/pub)\/([\w]{6}-[\w]{1,}-[\w]+)$

Explanation: Adding detailed explanation for above.

^http[s]?:                    ##Checking if URL starts from http OR https.
\/\/www\.linkedin\.com\/ ##Then checking if domain is www.linkedin.com
(in|pub|public-profile\/in|public-profile\/pub) ##Then checking if its followed by in OR pub OR public-profile/in OR public-profile/pub
\/([\w]{6}-[\w]{1,}-[\w]+)$ ##Checking if above is followed by / [\w] with 6 occurrences - [\w] with 1 or more occurrences and then [\w] with 1 or more occurrences.

NOTE: In case you want to check url should only start with https then change ^http[s]? TO ^https in above regex.

NOTE2: Above will create 2 capturing groups, in case you don't want to create any capturing groups try following.

^http[s]?:\/\/www\.linkedin\.com\/(?:in|pub|public-profile\/in|public-profile\/pub)\/(?:[\w]{6}-[\w]{1,}-[\w]+)$

Regex demo for note2 regex

How to validate a complete and valid url using Regex

This is pretty much a FAQ. You could simply try a search with [regex] +validate +url or just look at this answer: What is the best regular expression to check if a string is a valid URL

How to validate a url in Python? (Malformed or not)

django url validation regex (source):

import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False

Regex: validate a URL path with no query params

The regex you've defined is a character class. Instead, try:

^\/[/.a-zA-Z0-9-]+$

How can I match my regex for URL (example.net/directory) without HTTP, HTPPS and WWW?

Don't use a regex if you can, see if you can parse the url with a dedicated library

This will also help with other TLDs, such as .net, .org, .club.

>>> import urllib.parse
>>> urls = ("https://www.example.com/directory", "www.example.com/directory", "example.com/directory")
>>> for url in urls:
... print(urllib.parse.urlparse("http://" + url.split("//")[-1]))
...
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='example.com', path='/directory', params='', query='', fragment='')

To get just the top and second-level domain, you could just split() the netloc

>>> urllib.parse.urlparse("http://whatever.example.com").netloc.split(".")[-2:]
['example', 'com']

Regular Expression for URL in python

One simple fix would be to just replace the pattern https?://\S+ with empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.



Related Topics



Leave a reply



Submit