Parse All Links That Contain A Specific Word In href Tag
By using a condition.
<?php
$lookfor='/link:';
foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
echo "<br> ".$url->getAttribute('href')." , ".$url->getAttribute('title');
echo "<hr><br>";
}
}
?>
Regex to extract hyperlink containing a specific word
try this for all the a tag:
/<a [^>]*\bhref\s*=\s*"[^"]*SPECIFICWORD.*?<\/a>/
or just for the link (in the first capture group):
/<a [^>]*\bhref\s*=\s*"([^"]*SPECIFICWORD[^"]*)/
If you use php, for the link:
preg_match_all('/<a [^>]*\bhref\s*=\s*"\K[^"]*SPECIFICWORD[^"]*/', $text, $results);
How to get specific text hyperlinks in the home webpage by BeautifulSoup?
Here is something you can try:
Note that there are more links with the test article
in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article
is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article
in the href
of all elements a
.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]
How to extract html links with a matching word from a website using python
You need to search for the word india
in the displayed text. To do this you'll need a custom function instead:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)
The india_links
lambda finds all tags that are <a>
links with an href
attribute and contain india
(case insensitive) somewhere in the displayed text.
Note that I used the requests
response object .content
attribute; leave decoding to BeautifulSoup!
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link here; we had to use the lambda
search because a search with a text
regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.
A similar problem applies to the /news/world-asia-india-30632852
link; the nested <span>
element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.
You can extract just the links with:
from urllib.parse import urljoin
result_links = [urljoin(url, tag['href']) for tag in results]
where all relative URLs are resolved relative to the original URL:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']
Finding a link Element in Selenium which contains a specific word in its href with python
You can create an xpath or css expression to match webelements with href
containing the string ".exe":
driver.find_element_by_xpath("//*[contains(@href,'.exe')]")
#or
driver.find_element_by_css_selector("[href*='.exe']")
Beautiful Soup. How to get a link containing a specific word?
Many ways you can do that.Try css selector.
from bs4 import BeautifulSoup
html='''<div class="slide"><img src="xttps://site.com/files/r_1000,kljg894/43k5j/35h43jkl.jpg"></div>
<div class="slide"> <img src="xttps://site.com/files/r_2000,kljg894/43k5j/35h43jkl.jpg"></div>
<div class="slide"><img src="xttps://site.com/files/r_3000,kljg894/43k5j/35h43jkl.jpg"></div>'''
soup=BeautifulSoup(html,"html.parser")
for item in soup.select("img[src*='r_3000']"):
print(item['src'])
Get all links that contains a specific word in it
Use
/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[^\s/]*discord\S*/gi
See proof
Explanation
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
ftp 'ftp'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
file 'file'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
ftp 'ftp'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
[^\s/]* any character except: whitespace (\n, \r,
\t, \f, and " "), '/' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
discord 'discord'
--------------------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
Related Topics
Pdo Were Rows Affected During Execute Statement
Pdo Lastinsertid() Always Return 0
Parse Math Operations with PHP
Illegal Command Error Code 127 in PHP Exec Function
Php/Pdo/Mysql: Inserting into Mediumblob Stores Bad Data
Simple Xml Add Namespaced Child
Excel - Getting the Top 5 Data of a Column and Their Matching Title But Produces Duplicates
Calling a Function Within a Class Method
Laravel Eloquent: How to Get Only Certain Columns from Joined Tables
How to Keep Whitespace Formatting Using PHP/Html
How to Increment Count in the Replacement String When Using Preg_Replace
PHP Script to Execute at Certain Times
PHP Remember File Field Contents
Strange Echo, Print Behaviour in PHP