Parse All Links That Contain a Specific Word in "Href" Tag

Parse All Links That Contain A Specific Word In href Tag

By using a condition.

<?php 
$lookfor='/link:';

foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
echo "<br> ".$url->getAttribute('href')." , ".$url->getAttribute('title');
echo "<hr><br>";
}
}
?>

Regex to extract hyperlink containing a specific word

try this for all the a tag:

/<a [^>]*\bhref\s*=\s*"[^"]*SPECIFICWORD.*?<\/a>/

or just for the link (in the first capture group):

/<a [^>]*\bhref\s*=\s*"([^"]*SPECIFICWORD[^"]*)/

If you use php, for the link:

preg_match_all('/<a [^>]*\bhref\s*=\s*"\K[^"]*SPECIFICWORD[^"]*/', $text, $results);

How to get specific text hyperlinks in the home webpage by BeautifulSoup?

Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.

In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.

import requests
from bs4 import BeautifulSoup

url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)

if res.status_code != 200:
'no resquest'

soup = BeautifulSoup(res.content, "html.parser")

links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())

EDIT:

If you know that there is a word in the href, i.e. in the link itself:

soup.select("a[href*=article]")

this will search for the word article in the href of all elements a.

Edit: get only href:

hrefs = [link.get('href') for link in links_with_article]

How to extract html links with a matching word from a website using python

You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

from bs4 import BeautifulSoup
import requests

url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)

india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)

The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

Demo:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

You can extract just the links with:

from urllib.parse import urljoin

result_links = [urljoin(url, tag['href']) for tag in results]

where all relative URLs are resolved relative to the original URL:

>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

Finding a link Element in Selenium which contains a specific word in its href with python

You can create an xpath or css expression to match webelements with href containing the string ".exe":

driver.find_element_by_xpath("//*[contains(@href,'.exe')]")
#or
driver.find_element_by_css_selector("[href*='.exe']")

Beautiful Soup. How to get a link containing a specific word?

Many ways you can do that.Try css selector.

from bs4 import BeautifulSoup
html='''<div class="slide"><img src="xttps://site.com/files/r_1000,kljg894/43k5j/35h43jkl.jpg"></div>
<div class="slide"> <img src="xttps://site.com/files/r_2000,kljg894/43k5j/35h43jkl.jpg"></div>
<div class="slide"><img src="xttps://site.com/files/r_3000,kljg894/43k5j/35h43jkl.jpg"></div>'''
soup=BeautifulSoup(html,"html.parser")
for item in soup.select("img[src*='r_3000']"):
print(item['src'])

Get all links that contains a specific word in it

Use

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^\s/]*discord\S*/gi

See proof

Explanation

--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
ftp 'ftp'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
file 'file'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
ftp 'ftp'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
[^\s/]* any character except: whitespace (\n, \r,
\t, \f, and " "), '/' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
discord 'discord'
--------------------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))


Related Topics



Leave a reply



Submit