HTML parsing in perl
Something like this, quick and easy:
#! /usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $html = "Your HTML goes here";
my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
print $dd->attrs->{id}, "\n" if $skip++;
}
Parsing a HTML table in Perl
You need to use subTree.
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TagParser;
my $html = HTML::TagParser->new( 'foo.html' ); # Change this to your file
my $nrow = 0;
for my $tr ( $html->getElementsByTagName("tr" ) ) {
my $ncol = 0;
for my $td ( $tr->subTree->getElementsByTagName("td") ) {
print "Row [$nrow], Col [" . $ncol++ . "], Value [" . $td->innerText() . "]\n";
}
$nrow++;
}
Produces the following output (notice that the th rows are omitted):
Row [1], Col [0], Value [1027]
Row [1], Col [1], Value [21cs_337]
Row [1], Col [2], Value [0]
Row [1], Col [3], Value [catch-all caught]
Row [1], Col [4], Value [reason]
Row [2], Col [0], Value [10288]
Row [2], Col [1], Value [21cs_437]
Row [2], Col [2], Value [0]
Row [2], Col [3], Value [badfetch]
Row [2], Col [4], Value [reason]
Extract text from HTMl/XML tags in Perl
Once you've extracted the embedded XML document, you should use a proper XML parser.
use XML::LibXML qw( );
my $xml_doc = XML::LibXML->new->parse_string($xml);
for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
my $key = $key_node->getAttribute("name");
my $val = $key_node->findvalue("value/text()");
say "$key: $val";
}
So that leaves us with the question how to extract the XML document.
Option 1: XML::LibXML
You could use XML::LibXML and simply tell it to ignore the error (the spurious </p>
tag).
my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );
Option 2: Regex Match
You could probably get away with using a regex pattern match.
use HTML::Entities qw( decode_entities );
my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );
Option 3: Mojo::DOM
You could use Mojo::DOM to extract the embedded XML document.
use Encode qw( decode encode_utf8 );
use Mojo::DOM qw( );
my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );
The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.
(Of course, you could use Mojo::DOM to parse the XML too.)
Note that the HTML fragment <p><pre></pre></p>
means <p></p><pre></pre>
, and both XML::LibXML and Mojo::DOM handle this correctly.
Parse HTML using perl regex
I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.
HTML parsing with HTML::TokeParser::Simple
Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.
my $ignore=0;
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('script') ) {
print $token->as_is, "\n";
$ignore = 1;
next;
}
if ( $token->is_end_tag('script') ) {
$ignore = 0;
print $token->as_is, "\n";
next;
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
print $token->as_is, "\n";
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is, "\n";
}
}
Whats a simple Perl script to parse a HTML document with custom tags(Perl interpreter)?
The important part of parsing with HTML::Parser
is to assign the right handlers
with the right argspec
. A sample program:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Parser;
my $html;
sub replace_tagname {
my ( $tagname, $event ) = @_;
if ( $tagname eq 'liamslanguage' ) {
$tagname = 'b';
}
if ( $event eq 'start' ) {
$html .= "<$tagname>";
}
elsif ( $event eq 'end' ) {
$html .= "</$tagname>";
}
}
my $p = HTML::Parser->new(
'api_version' => 3,
'start_h' => [ \&replace_tagname, 'tagname, event' ],
'default_h' => [ sub { $html .= shift }, 'text' ],
'end_h' => [ \&replace_tagname, 'tagname, event' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();
print $html;
__DATA__
<html>
This is HTML talking
<liamslanguage>say "This is Liams language speaking"</liamslanguage>
</html>
What does this HTML::Parser() code do in Perl?
From the documentation:
$p = HTML::Parser->new(api_version => 3,
text_h => [ sub {...}, "dtext" ]);
This creates a new parser object with a text event handler subroutine that receives the original text with general entities decoded.
Edit:
use HTML::Parser;
use LWP::Simple;
my $html = get "http://perltraining.stonehenge.com";
HTML::Parser->new(text_h => [\my @accum, "text"])->parse($html);
print map $_->[0], @accum;
Another
#!/usr/bin/perl -w
use strict;
use HTML::Parser;
my $text;
my $p = HTML::Parser->new(text_h => [ sub {$text .= shift},
'dtext']);
$p->parse_file('test.html');
print $text;
Which, when used on a file like this:
<html>
<head>
<title>Test</title>
</head>
<body>
<h1>Test Stuff</h1>
<p>This is a test</p>
<ul>
<li>this</li>
<li>is a</li>
<li>list</li>
</ul>
</body>
</html>
produces the following output:
Test
Test Stuff
This is a test
this
is a
list
Does that help?
Related Topics
How to Set an HTML Class Attribute in Markdown
Adding Style to File Upload Button in CSS
Diagonal Stripes That Are 1Px Wide
Thymeleaf - How to Add Checked Attribute to Input Conditionally
How to Preload Images Without JavaScript
How to Wrap Text Around a Non Rectangular Image
Content Security Policy: "Img-Src 'self' Data:"
How to Display Text Around Circle. CSS Shape-Outside
Writing HTML Form Data to a Txt File Without The Use of a Webserver
HTML5 Getusermedia Record Webcam, Both Audio and Video
Fix Columns in Horizontal Scrolling
What Happens When Localstorage Is Full
<Button Type="Submit"> Compatibility
HTML5 Canvas Slows Down with Each Stroke and Clear