HTML Parsing in Perl

HTML parsing in perl

Something like this, quick and easy:

#! /usr/bin/perl
use strict;
use warnings;

use Mojo::DOM;

my $html = "Your HTML goes here";

my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
    print $dd->attrs->{id}, "\n" if $skip++;
}

Parsing a HTML table in Perl

You need to use subTree.

#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TagParser;

my $html = HTML::TagParser->new( 'foo.html' ); # Change this to your file

my $nrow = 0;
for my $tr ( $html->getElementsByTagName("tr" ) ) {
    my $ncol = 0;
    for my $td ( $tr->subTree->getElementsByTagName("td") ) {
        print "Row [$nrow], Col [" . $ncol++ . "], Value [" . $td->innerText() . "]\n";
    }
    $nrow++;
}

Produces the following output (notice that the th rows are omitted):

Row [1], Col [0], Value [1027]
Row [1], Col [1], Value [21cs_337]
Row [1], Col [2], Value [0]
Row [1], Col [3], Value [catch-all caught]
Row [1], Col [4], Value [reason]
Row [2], Col [0], Value [10288]
Row [2], Col [1], Value [21cs_437]
Row [2], Col [2], Value [0]
Row [2], Col [3], Value [badfetch]
Row [2], Col [4], Value [reason]

Extract text from HTMl/XML tags in Perl

Once you've extracted the embedded XML document, you should use a proper XML parser.

use XML::LibXML qw( );

my $xml_doc = XML::LibXML->new->parse_string($xml);

for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
   my $key = $key_node->getAttribute("name");
   my $val = $key_node->findvalue("value/text()");
   say "$key: $val";
}

So that leaves us with the question how to extract the XML document.

Option 1: XML::LibXML

You could use XML::LibXML and simply tell it to ignore the error (the spurious  tag).

my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );

Option 2: Regex Match

You could probably get away with using a regex pattern match.

use HTML::Entities qw( decode_entities );

my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );

Option 3: Mojo::DOM

You could use Mojo::DOM to extract the embedded XML document.

use Encode    qw( decode encode_utf8 );
use Mojo::DOM qw( );

my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);    
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );

The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.

(Of course, you could use Mojo::DOM to parse the XML too.)

Note that the HTML fragment <pre></pre> means <pre></pre>, and both XML::LibXML and Mojo::DOM handle this correctly.

Parse HTML using perl regex

I think you'd be much better off using HTML::Parser to simply/reliably parse that HTML. Otherwise you're into the nightmare of parsing HTML with regexps, and you'll find that doesn't work reliably.

HTML parsing with HTML::TokeParser::Simple

Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.

   my $ignore=0;

   while ( my $token = $p->get_token ) {

      if ( $token->is_start_tag('script') ) {
         print $token->as_is, "\n";
         $ignore = 1;
         next;
      }
      if ( $token->is_end_tag('script') ) {
         $ignore = 0;
         print $token->as_is, "\n";
         next;
      }
      if ($ignore) {
         #Everything inside the script tag. Here you can ignore or print as is
         print $token->as_is, "\n";
      }
      else
      {  
          #Everything excluding scripts falls here handle as appropriate
          next unless $token->is_text;
          print $token->as_is, "\n";
      }
    }

Whats a simple Perl script to parse a HTML document with custom tags(Perl interpreter)?

The important part of parsing with HTML::Parser is to assign the right handlers with the right argspec. A sample program:

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

my $html;
sub replace_tagname {
    my ( $tagname, $event ) = @_;

    if ( $tagname eq 'liamslanguage' ) {
        $tagname = 'b';
    }

    if ( $event eq 'start' ) {
        $html .= "<$tagname>";
    }
    elsif ( $event eq 'end' ) {
        $html .= "</$tagname>";
    }
}

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'     => [ \&replace_tagname,      'tagname, event' ],
    'default_h'   => [ sub { $html .= shift }, 'text'           ],
    'end_h'       => [ \&replace_tagname,      'tagname, event' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

print $html;

__DATA__
<html>
This is HTML talking
<liamslanguage>say "This is Liams language speaking"</liamslanguage>
</html>

What does this HTML::Parser() code do in Perl?

From the documentation:

$p = HTML::Parser->new(api_version => 3,
                       text_h => [ sub {...}, "dtext" ]);

This creates a new parser object with a text event handler subroutine that receives the original text with general entities decoded.

Edit:

use HTML::Parser;
use LWP::Simple;
my $html = get "http://perltraining.stonehenge.com";
HTML::Parser->new(text_h => [\my @accum, "text"])->parse($html);
print map $_->[0], @accum;

Another

#!/usr/bin/perl -w
use strict;
use HTML::Parser;
my $text;
my $p = HTML::Parser->new(text_h => [ sub {$text .= shift}, 
                                     'dtext']);
$p->parse_file('test.html');
print $text;

Which, when used on a file like this:

<html>
<head>
<title>Test</title>
</head>
<body>
<h1>Test Stuff</h1>
<p>This is a test</p>
<ul>
<li>this</li>
<li>is a</li>
<li>list</li>
</ul>
</body>
</html>

produces the following output:

Test

Test Stuff
This is a test

this
is a
list

Does that help?

HTML Parsing in Perl