How do you validate a URL with a regular expression in Python?
An easy way to parse (and validate) URL's is the urlparse
(py2, py3) module.
A regex is too much work.
There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.
Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.
For example :::::
is a valid URL. The path is ":::::"
. A pretty stupid filename, but a valid filename.
Also, /////
is a valid URL. The netloc ("hostname") is ""
. The path is "///"
. Again, stupid. Also valid. This URL normalizes to "///"
which is the equivalent.
Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.
Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.
Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?
These are not RFC-specified validations. These are validations unique to your application.
Use custom regex to validate linkedIn url
With your shown samples only, could you please try following regex. Online demo of regex is: Online demo of regex
^http[s]?:\/\/www\.linkedin\.com\/(in|pub|public-profile\/in|public-profile\/pub)\/([\w]{6}-[\w]{1,}-[\w]+)$
Explanation: Adding detailed explanation for above.
^http[s]?: ##Checking if URL starts from http OR https.
\/\/www\.linkedin\.com\/ ##Then checking if domain is www.linkedin.com
(in|pub|public-profile\/in|public-profile\/pub) ##Then checking if its followed by in OR pub OR public-profile/in OR public-profile/pub
\/([\w]{6}-[\w]{1,}-[\w]+)$ ##Checking if above is followed by / [\w] with 6 occurrences - [\w] with 1 or more occurrences and then [\w] with 1 or more occurrences.
NOTE: In case you want to check url should only start with https then change ^http[s]?
TO ^https
in above regex.
NOTE2: Above will create 2 capturing groups, in case you don't want to create any capturing groups try following.
^http[s]?:\/\/www\.linkedin\.com\/(?:in|pub|public-profile\/in|public-profile\/pub)\/(?:[\w]{6}-[\w]{1,}-[\w]+)$
Regex demo for note2 regex
How to validate a complete and valid url using Regex
This is pretty much a FAQ. You could simply try a search with [regex] +validate +url
or just look at this answer: What is the best regular expression to check if a string is a valid URL
How to validate a url in Python? (Malformed or not)
django url validation regex (source):
import re
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None) # False
Regex: validate a URL path with no query params
The regex you've defined is a character class. Instead, try:
^\/[/.a-zA-Z0-9-]+$
How can I match my regex for URL (example.net/directory) without HTTP, HTPPS and WWW?
Don't use a regex if you can, see if you can parse the url with a dedicated library
This will also help with other TLDs, such as .net
, .org
, .club
.
>>> import urllib.parse
>>> urls = ("https://www.example.com/directory", "www.example.com/directory", "example.com/directory")
>>> for url in urls:
... print(urllib.parse.urlparse("http://" + url.split("//")[-1]))
...
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='example.com', path='/directory', params='', query='', fragment='')
To get just the top and second-level domain, you could just split()
the netloc
>>> urllib.parse.urlparse("http://whatever.example.com").netloc.split(".")[-2:]
['example', 'com']
Regular Expression for URL in python
One simple fix would be to just replace the pattern https?://\S+
with empty string:
article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)
This prints:
眼影盤長這樣 說真的 很不好拍
My pattern assumes that whatever non whitespace characters which follow http://
or https://
are part of the URL.
Related Topics
How to Specify New Lines on Python, When Writing on Files
Python/Ipython Importerror: No Module Named Site
Importerror: Libcblas.So.3: Cannot Open Shared Object File: No Such File or Directory
Combine Two Pandas Data Frames (Join on a Common Column)
Display Image as Grayscale Using Matplotlib
In Practice, What Are the Main Uses for the "Yield From" Syntax in Python 3.3
Typeerror: Not All Arguments Converted During String Formatting Python
How to Drop a List of Rows from Pandas Dataframe
Circular List Iterator in Python
How to Execute a File Within the Python Interpreter
Prevent Sleep Mode Python (Wakelock on Python)
Circular Shift of Vector (Equivalent to Numpy.Roll)
How to Define a Threshold Value to Detect Only Green Colour Objects in an Image with Python Opencv
Create a .CSV File with Values from a Python List
How to Write a Python Module/Package
What's the Function Like Sum() But for Multiplication? Product()