How to Extract a Url from a String Using Python

Extracting a URL in Python

In response to the OP's edit I hijacked Find Hyperlinks in Text using Python (twitter related) and came up with this:

import re

myString = "This is my tweet check it out http://example.com/blah"

print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))

How do you extract a url from a string using python?

There may be few ways to do this but the cleanest would be to use regex

>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com

If there can be multiple links you can use something similar to below

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>

Extract all urls in a string with python3

Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.

Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).

You have a couple of examples here.

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']

It seems that this module also has an update() method which lets you update the TLD list cache file

However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:

result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk'] 

You can then build another lists which hold the excluded domains / TLDs / etc:

allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']

for each_url in results:
# here, check each url against your rules

how to extract url from a string and save to a list

Are you just after the "next page" or do you want all the links?

so do you want just:

/jobs?q=software+engineer+&l=Kerala&start=10

or are you after all of these?

/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10

Few issues:

  1. Links1 is a list of elements. And you are then using .find('a') on a list, which won't work.
  2. Since you want href attributes, consider using the find('a',href=True)

So here's how I would go about it:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})

url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'

Output:

print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10

To get all those links:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})

urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'

Use RegEx in Python to extract URL and optional query string from web server log data

You can use

^(?P<url>[^?]+)(?P<querystr>\?.*)?$

Details

  • ^ - start of string
  • (?P<url>[^?]+) - Group "url": any one or more chars other than ?
  • (?P<querystr>\?.*)? - an optional Group "querystr": a ? char and then any zero or more chars other than line break chars as many as possible
  • $ - end of string.

See the regex demo.

Extracting URL link using regular expression re - string matching - Python

re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>


Related Topics



Leave a reply



Submit