Get All Links on HTML Page

Get all links on html page?

I'd look at using the Html Agility Pack.

Here's an example straight from their examples page on how to find all the links in a page:

 HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{

}

How do I get all the links on a web page using requests-html

You can get all the internal and external links using the code below. The code converts all relative links to absolute links.

from requests_html import HTMLSession

base_url = 'https://en.wikipedia.org'
sss = HTMLSession()
k = sss.get('https://en.wikipedia.org/wiki')
links = k.html.absolute_links
print(links)

how to extract links and titles from a .html page?

Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.

Extract all links from web page

You can do this using Ruby's built-in URI class. Look at the extract method.

It's not as smart as what you could write using Nokogiri and looking in anchors, images, scripts, on_click handlers, etc., but it's a good and fast starting point.

For instance, looking at the content of this question's page:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "https://stackauth.com",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "http://schema.org/Article",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://stackexchange.com/legal/privacy-policy'",
# "http://stackexchange.com/legal/terms-of-service'",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
# "https:",
# "https:'==document.location.protocol,",
# "https://ssl",
# "http://www",
# "https://secure",
# "http://edge",
# "https:",
# "https://sb",
# "http://b"]

There are a lot of other entries, but using grep filters them out using a simple /^https?:/ pattern.

A simple starting point with Nokogiri is:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag|
case tag.name.downcase
when 'a'
tag['href']
when 'img'
tag['src']
end
}

urls
# => ["//stackexchange.com/sites",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "//stackoverflow.com",
# "//meta.stackoverflow.com",
# "//careers.stackoverflow.com",
# "//stackexchange.com",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/tour",
# "/help",
# "//careers.stackoverflow.com",
# "/",
# "/questions",
# "/tags",
# "/about",
# "/users",
# "/questions/ask",
# "/about",
# nil,
# "/questions/21069348/extract-all-links-from-web-page",
# nil,
# nil,
# "#",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/q/21069348",
# "/posts/21069348/edit",
# "/users/2886945/ivan-denisov",
# "/users/2886945/ivan-denisov",
# "/users/2767755/arup-rakshit",
# "/users/2886945/ivan-denisov",
# nil,
# nil,
# "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
# nil,
# nil,
# nil,
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "/a/21069456",
# "/posts/21069456/revisions",
# "/users/128421/the-tin-man",
# "/users/128421/the-tin-man",
# nil,
# nil,
# nil,
# nil,
# "http://regex101.com/r/hN4dI0",
# "/a/21069536",
# "/users/1214800/r3mus",
# "/users/1214800/r3mus",
# nil,
# nil,
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
# "#",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/legal/terms-of-service",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/questions/ask",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "?lastactivity",
# "/q/21052437",
# "/questions/21052437/are-these-two-lines-the-same-vs",
# "/q/6700367",
# "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/q/430966",
# "/questions/430966/regex-for-links-in-html-text",
# "/q/3703712",
# "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
# "/q/5120171",
# "/questions/5120171/extract-links-from-a-web-page",
# "/q/6816138",
# "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
# "/q/10177910",
# "/questions/10177910/php-regular-expression-extracting-html-links",
# "/q/10217857",
# "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
# "/q/11300496",
# "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
# "/q/11307491",
# "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
# "/q/17712493",
# "/questions/17712493/extract-links-from-bbcode-with-ruby",
# "/q/20290869",
# "/questions/20290869/strip-away-html-tags-from-extracted-links",
# "//stackexchange.com/questions?tab=hot",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "#",
# "/feeds/question/21069348",
# "/about",
# "/help",
# "/help/badges",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# nil,
# "/contact",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
# "/posts/21069348/ivc/8228",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

That uses a case statement to apply a bit of "smarts" to know which field should be retrieved from a particular type of tag. More work would need to be done, since an anchor could use an on_click, and there could be other tags being used for JavaScript events.

Get all links from html page using regex

You need to use a global modifier /g to get multiple matches with RegExp#exec.

Besides, since your input is HTML code, you need to make sure you do not grab < with \S:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(\/[^"<]*)?/g

See the regex demo.

If for some reason this pattern does not match equal signs, add it as an alternative:

/(?:ht|f)tps?:\/\/[-a-zA-Z0-9.]+\.[a-zA-Z]{2,3}(?:\/(?:[^"<=]|=)*)?/g

See another demo (however, the first one should do).

How to get all links from a website with puppeteer

It is possible to get all links from a URL using only node.js, without puppeteer:

There are two main steps:

  1. Get the source code for the URL.
  2. Parse the source code for links.

Simple implementation in node.js:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);

// Output to console:
console.log(matches);

// Alternatively, return the array of links for further processing:
return matches;
}

Sample usage:

$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]

Notes:

  • I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
    • Native node.js http/https libraries
    • Puppeteer (Get complete web page source html with puppeteer - but some part always missing)

How to extract all links from a website using python

The site is blocked for Python Bots:

<h1>Access denied</h1>
<p>This website is using a security service to protect itself from online attacks.</p>

You can try adding an user agent to your code, like below:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
Web=requests.get("https://www.jarir.com/", headers=headers)
soup=BeautifulSoup(Web.text)
for link in soup.findAll('a'):
print(link['href'])

The output is something like:

https://www.jarir.com/wishlist/
https://www.jarir.com/sales/order/history/
https://www.jarir.com/afs/
https://www.jarir.com/contacts/
tel:+966920000089
/cdn-cgi/l/email-protection#6300021106230902110a114d000c0e
https://www.jarir.com/faq/
https://www.jarir.com/warranty_policy/
https://www.jarir.com/return_exchange/
https://www.jarir.com/contacts/
https://www.jarir.com/terms-of-service/
https://www.jarir.com/privacy-policy/
https://www.jarir.com/storelocator/

Not getting all links from webpage

You have to scroll through the website and reach the end of the page in order to load all the scripts in the webpage. Just by opening the website we will load only the script that is necessary to view that particular section of the webpage. Therefore when you ran your code it could retrieve data from only those scripts that were loaded.

This one gave me 160 links :

driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)

#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')

# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
scroll_height = scroll_height + (height/10)
driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
sleep(2)

# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
if i.get_attribute('href') is not None:
print(i.get_attribute('href'))
count+=1

print(count)
driver.quit()


Related Topics



Leave a reply



Submit