How can I extract URL and link text from HTML in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
How can I extract URL tags and link text from HTML in Perl?
The simple version, based on your question
- a page that looks like yours (so no obscure html that can mess up)
- te desired output
This might be what you are looking for:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get('file:page.html');
foreach my $link ($mech->links) {
my $text = $link->text;
my $url = $link->url;
my $title = $link->attrs->{title};
print "$text, $url, $title\n"
}
Happy coding, TIMTOWTDI
How can I extract the text from links on an HTML page in Perl?
This is done very simply with a regular expression, as shown in the program below. It looks for a string of digits or colons immediately following >
(and so looks for the text contents of the element rather the href
attribute value as yours does) and captures that string into $1
.
But I would prefer to see the problem solved from start to finish using a proper HTML parser, such as HTML::TreeBuilder
orMojo::DOM
.
use strict;
use warnings;
my @tag = <DATA>;
foreach (@tag) {
next unless />([\d:]+)/;
print "http://x.download.com/$1\n";
}
__DATA__
href="?Name">Name</a>
href="?Desc">Hourly Details</a>
href="/24x7/2012/11-November/">Data
href="./00:00:00/">00:00:00/</a>
href="./01:00:00/">01:00:00/</a>
href="./02:00:00/">02:00:00/</a>
href="./03:00:00/">03:00:00/</a>
href="./04:00:00/">04:00:00/</a>
href="./05:00:00/">05:00:00/</a>
href="./06:00:00/">06:00:00/</a>
href="./07:00:00/">07:00:00/</a>
href="./08:00:00/">08:00:00/</a>
href="./09:00:00/">09:00:00/</a>
href="./10:00:00/">10:00:00/</a>
output
http://x.download.com/00:00:00
http://x.download.com/01:00:00
http://x.download.com/02:00:00
http://x.download.com/03:00:00
http://x.download.com/04:00:00
http://x.download.com/05:00:00
http://x.download.com/06:00:00
http://x.download.com/07:00:00
http://x.download.com/08:00:00
http://x.download.com/09:00:00
http://x.download.com/10:00:00
Perl Regex to extract URLs from HTML
Obligatory link explaining why you shouldn't parse HTML using a regular expression.
That being said, try this for a quick and dirty solution:
my $html = '<a href="http://www.facebook.com/">A link!</a>';
my @links = $html =~ /<a[^>]*\shref=['"](https?:\/\/www\.facebook\.com[^"']*)["']/gis;
How can I extract URLs from plain text with Perl?
When I tried URI::Find::Schemeless with the following text:
Here is a URL and one bare URL with
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)
Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user@example.org/test/me
How about one without a protocol www.example.com?
it messed up http://example.org/(9.3)
. So, I came up with the following with the help of Regexp::Common:
#!/usr/bin/perl
use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;
my $heuristic = URI::Find::Schemeless->schemeless_uri_re;
my $pattern = qr{
$RE{URI}{HTTP}{-scheme=>'https?'} |
$RE{URI}{FTP} |
$heuristic
}x;
local $/ = '';
while ( my $par = <DATA> ) {
chomp $par;
$par =~ s/</</g;
$par =~ s/( $pattern ) / linkify($1) /gex;
print "<p>$par</p>\n";
}
sub linkify {
my ($str) = @_;
$str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
$str = escapeHTML($str);
sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}
This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3))
.
Related Topics
Height Percentage Not Working in CSS
Webgl: Prevent Color Buffer from Being Cleared
How to Run the CSS3 Animation to the End If the Selector Is Not Matching Anymore
My @Media Queries Aren't Working on Mobile Devices
What Is the HTML For="" Attribute in <Label>
Is Cross-Origin Postmessage Broken in Ie10
Uppercase or Lowercase Doctype
Allow Specific Tag to Override Overflow:Hidden
What Is the ::Content/::Slotted Pseudo-Element and How Does It Work
Export HTML Table to Excel Using ASP.NET
How to Get a List of All Countries/Cities to Populate a Listbox
How Is the Margin-Top Percentage Calculated
How to Display an .HTML Document , or .HTML Fragment at CSS Content
Multiple Forms or Multiple Submits in a Page