Trouble With Parsing Table Data in Perl

Trouble handling data from parsed file

That is hundreds of lines of code. If you are just trying to extract the table out of the HTML, use HTML::TableExtract.

For example:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
headers => [
'',
'novel miRNAs',
'known miRBase miRNAs',
'',
'',
],
slice_columns => 0,
);

$te->parse_file('t.html');

for my $table ( $te->tables ) {
for my $row ( $table->rows ) {
print Dump $row;
}
}

Parsing a HTML table in Perl

You need to use subTree.

#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TagParser;

my $html = HTML::TagParser->new( 'foo.html' ); # Change this to your file

my $nrow = 0;
for my $tr ( $html->getElementsByTagName("tr" ) ) {
my $ncol = 0;
for my $td ( $tr->subTree->getElementsByTagName("td") ) {
print "Row [$nrow], Col [" . $ncol++ . "], Value [" . $td->innerText() . "]\n";
}
$nrow++;
}

Produces the following output (notice that the th rows are omitted):

Row [1], Col [0], Value [1027]
Row [1], Col [1], Value [21cs_337]
Row [1], Col [2], Value [0]
Row [1], Col [3], Value [catch-all caught]
Row [1], Col [4], Value [reason]
Row [2], Col [0], Value [10288]
Row [2], Col [1], Value [21cs_437]
Row [2], Col [2], Value [0]
Row [2], Col [3], Value [badfetch]
Row [2], Col [4], Value [reason]

Parsing data with perl- capturing a range of text

This is my Final solution. In this particular case I'm searching for all switchports that have a maximum port-security not equal to 1. This is just an example and can be switched for any configuration. I'm also omitting certain interfaces from being caught if that configuration is actually applied to them.

#!/usr/bin/perl
$MDIR='/currentConfig';

#list of interfaces you don't want to see to filter output
@omit =(
'MANAGEMENT.PORT',
'sup.mgmt',
'Internal.EtherSwitch',
'Router',
'ip address \d',
'STRA'
);
#join with '|' to form the regex
$dontwant = join('|',@omit);

#search criteria
$search='switchport port-security maximum [^1]';

opendir(DIR,$MDIR) or die $!;
@dirContents=readdir DIR;close DIR;

foreach $file (@dirContents) {
open(IN,$MDIR.'/'.$file) or die $!;
#record seperator to !
$/='!';
my @inFile=<IN>; close IN;
#since the record seperator has been changed, '^' won't match beginning of line
my @ints = grep (/\ninterface/i,@inFile);
#set record seperator back to normal
$/="\n";
foreach $int (@ints) {
if ( $int =~ m/$search/i && $int !~ m/$dontwant/) {
push(@finalint,$int);
}
}
}
#just list the interfaces found, i'll use this to make it comma seperated
foreach $elem (@finalint) {
print $elem;
}

How can I extract HTML table data using Perl?

HTML::TableExtract sounds exactly like what you are looking for.

Is there a parser for the output from Perl's Text::Table?

It seems somewhat convoluted. If you had the information before converting it into a table, then why try to parse it from its presentation form? It's like having a text file, converting it to latex, then to postscript, and then trying to get the text back from the postscript file.

I'm sure there's a way to parse the output of Text::Table, but it seems that your workflow is flawed; I'd aim at using a simpler output for the data (besides Text::Table, if you really have to have it that way) like YAML that can then be trivially restored to the original data structure.

How to use the Perl TableExtract rows method when there are duplicate Header fields

I found the answer: It is necessary to add the "slice_columns=> 0" attribute to the HTML::TableExtract constructor.

I'm not exactly sure why this is necessary. The help for TableExtract at CPAN says "Columns that are not beneath one of the provided headers will be ignored unless slice_columns was set to 0. Columns will, by default, be rearranged into the same order as the headers you provide (see the automap parameter for more information) unless slice_columns is 0."

In my table, every column is under a provided header. There must be an interaction in the case where headers are not unique, and setting slice_columns to 0 avoids the issue.

my $te = HTML::TableExtract->new(
headers => \@headers,
slice_columns=> 0
);


Related Topics



Leave a reply



Submit