how to extract links and titles from a .html page?
Thank you everyone, I GOT IT!
The final code:
$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
This shows you the anchor text assigned and the href for all links in a .html file.
Again, thanks a lot.
Want to extract links and titles from a certain website with lxml and python but cant
The reason this does not work is that the site you're trying to fetch uses JavaScript to generate the results, which means Selenium is your only option if you want to scrape the HTML. Any static fetching and processing libraries like lxml and beautifulsoup simply do not have the ability to parse the result of JavaScript calls.
How to extract links from a page using Beautiful soup
link = i.find('a',href=True)
always not return anchor tag (a)
, it may be return NoneType
, so you need to validate link is None, continue for loop,else get link href value.
Scrape link by url:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Scrape link by HTML:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-1st-post/" title="Example of 1st post – Example 1 Post" rel="bookmark">sample post – example 1 post</a></h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title"><a href="https://www.example.com/blog/111/this-is-2nd-post/" title="Example of 2nd post – Example 2 Post" rel="bookmark">sample post – example 2 post</a></h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Update:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
For chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Where '/usr/bin/chromedriver'
chrome webdriver path.
How can I extract URL and link text from HTML in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Extract Title from html link
With a regular expression, the group will contain it ([^"]*)
:
title="([^"]*)"
C#
using System.Text.RegularExpressions;
static void Main(string[] args)
{
string originalString = "<a href=\" / tothepage\" title=\"the page\">The Link</a>.";
Regex rgx = new Regex("title=\"([^\"]*)\"", RegexOptions.IgnoreCase);
Match match = rgx.Matches(originalString)[0];
Console.WriteLine(match.Groups[1]);
Console.ReadLine();
}
xpath to extract link or hrefs
HTML of individual tiles on the right of the linked page is in the following form * :
<div class="details">
<a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>
<a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run
<span class="paragraph-end"/>
</a>
<div>....</div>
<div>....</div>
</div>
Turned out that <a>
element with class="title"
uniquely identify your target <a>
elements in that page. So the XPath can be as simple as :
//a[@class="title"]/@href
Anyway, the problem you noticed seems to be specific to the Chrome XPath evaluator **. Since you mentioned about Python, simple Python codes proves that the XPath should work just fine :
>>> from urllib2 import urlopen
>>> from lxml import html
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe')
>>> raw = req.read()
>>> root = html.fromstring(raw)
>>> [h for h in root.xpath("//a[@class='title']/@href")]
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls']
*) Stripped down version. You can take this as an example of providing minimal HTML sample.
**) I can reproduce this problem, that @href
s are printed as empty string in my Chrome console. The same problem happened to others as well : Chrome element inspector Xpath with @href won't show link text
How to extracting a sub a tags and prints them out
You should not use two loops (the first one has a wrong syntax BTW). You can use XPath to get right to the link nodes, by adding /a
to the search path:
foreach ($xpath->query("//*[@class='$classname']/a") as $link) {
echo $link->getAttribute('href');
echo "<br />";
}
Related Topics
JavaScript Equivalent of PHP's List()
Using Strtotime for Dates Before 1970
Mechanisms for Tracking Db Schema Changes
Best Way to Determine If a Url Is an Image in PHP
How to Run PHP Files on My Computer
Convert PDF to Jpeg with PHP and Imagemagick
Soap PHP Fault Parsing Wsdl: Failed to Load External Entity
Increment Value in MySQL Update Query
Naming Convention Issues When Using Codeigniter in Windows and Linux
Best Way to Get Files from a Dir Filtered by Certain Extension in PHP
What's an Actual Use of Variable Variables
How to Convert String to Boolean PHP