How to Extract Links and Titles from a .HTML Page

how to extract links and titles from a .html page?

Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.

Want to extract links and titles from a certain website with lxml and python but cant

The reason this does not work is that the site you're trying to fetch uses JavaScript to generate the results, which means Selenium is your only option if you want to scrape the HTML. Any static fetching and processing libraries like lxml and beautifulsoup simply do not have the ability to parse the result of JavaScript calls.

How to extract links from a page using Beautiful soup

link = i.find('a',href=True) always not return anchor tag (a), it may be return NoneType, so you need to validate link is None, continue for loop,else get link href value.

Scrape link by url:

import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])

Scrape link by HTML:

from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2><div class="post-meta clearfix">'''

soup = BeautifulSoup(html, "html.parser")

for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])

Update:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")

soup = BeautifulSoup(driver.page_source, "html.parser")

for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])

O/P:

https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/

For chrome browser:

http://chromedriver.chromium.org/downloads

Install web driver for chrome browser:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

selenium tutorial

https://selenium-python.readthedocs.io/

Where '/usr/bin/chromedriver' chrome webdriver path.

How can I extract URL and link text from HTML in Perl?

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.

Extract Title from html link

With a regular expression, the group will contain it ([^"]*):

title="([^"]*)"

C#

using System.Text.RegularExpressions;
static void Main(string[] args)
{
string originalString = "<a href=\" / tothepage\" title=\"the page\">The Link</a>.";
Regex rgx = new Regex("title=\"([^\"]*)\"", RegexOptions.IgnoreCase);
Match match = rgx.Matches(originalString)[0];
Console.WriteLine(match.Groups[1]);
Console.ReadLine();
}

xpath to extract link or hrefs

HTML of individual tiles on the right of the linked page is in the following form * :

<div class="details"> 
<a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>
<a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run
<span class="paragraph-end"/>
</a>
<div>....</div>
<div>....</div>
</div>

Turned out that <a> element with class="title" uniquely identify your target <a> elements in that page. So the XPath can be as simple as :

//a[@class="title"]/@href

Anyway, the problem you noticed seems to be specific to the Chrome XPath evaluator **. Since you mentioned about Python, simple Python codes proves that the XPath should work just fine :

>>> from urllib2 import urlopen
>>> from lxml import html
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe')
>>> raw = req.read()
>>> root = html.fromstring(raw)
>>> [h for h in root.xpath("//a[@class='title']/@href")]
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls']

*) Stripped down version. You can take this as an example of providing minimal HTML sample.

**) I can reproduce this problem, that @hrefs are printed as empty string in my Chrome console. The same problem happened to others as well : Chrome element inspector Xpath with @href won't show link text

How to extracting a sub a tags and prints them out

You should not use two loops (the first one has a wrong syntax BTW). You can use XPath to get right to the link nodes, by adding /a to the search path:

foreach ($xpath->query("//*[@class='$classname']/a") as $link) {
echo $link->getAttribute('href');
echo "<br />";
}


Related Topics



Leave a reply



Submit