How to Parse HTML Table Using PHP

How to parse HTML table using PHP?

For tidy HTML codes, one of the parsing approach can be DOM. DOM divides your HTML code into objects and then allows you to call the desired object and its values/tag name etc.

The official documentation of PHP HTML DOM parsing is available at http://php.net/manual/en/book.dom.php

For finding the values of second column for the given table following DOM implementation can be done:

<?php
$data = file_get_contents('http://mytemporalbucket.s3.amazonaws.com/code.txt');

$dom = new domDocument;

@$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');

$rows = $tables->item(1)->getElementsByTagName('tr');

foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
echo $cols[2];
}

?>

Reference: Customized the code provided at How to parse this table and extract data from it? to match this question's demand.

PHP DOM Parser parse html table

$field_names = ['username', 'phone', 'status'];
$result = [];

// Search for div tags having tbl-process-mobile class
$containers = $doc->getElementsByTagName('div');
foreach ($containers as $container) {
if (!isset($container->attributes['class']))
continue;

if (false === strpos($container->attributes['class']->value,
'tbl-process-mobile'))
continue;

// Assume that tbody tags are required
if (!$tbodies = $container->getElementsByTagName('tbody'))
continue;

// Get the first tbody (there should not be more)
if (!$tbodies->length || !$tbody = $tbodies->item(0))
continue;

foreach ($tbody->getElementsByTagName('tr') as $tr) {
$i = 0;
$row = [];
$cells = $tr->getElementsByTagName('td');

// Collect the first count($field_names) cell values as maximum
foreach ($field_names as $name) {
if (!$td = $cells->item($i++))
break;
$row[$name] = trim($td->textContent);
}

if ($row)
$result []= $row;
}
}

var_dump($result);

Sample Output

array(2) {
[0]=>
array(3) {
["username"]=>
string(14) "randomusername"
["phone"]=>
string(10) "0123456789"
["status"]=>
string(6) "active"
}
[1]=>
array(3) {
["username"]=>
string(15) "randomusername2"
["phone"]=>
string(10) "0987654321"
["status"]=>
string(6) "active"
}
}

No comments, as the code is self-explanatory.

P.S.: in the sense of parsing, the HTML structure leaves a lot to be desired.

How to extracting Data from HTML table using php

You can use xpath to query('//td') and retrieve the td html using C14N(), something like:

$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}

Output:

<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...

C14N();

Returns canonicalized nodes as a string or FALSE on failure


Update:

Another question, how can I grab individual Table Data? For example,
just grab, Job ID

Use XPath contains, i.e.:

foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}

Update V2:

How can I get the next Table Data after that (to actually get the Job
Id) ?

Use following-sibling::*[1], i.e:

echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992

Parse html table using file_get_contents to php array

Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.

I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.

In principle, with Simple HTML DOM you just write something like:

$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
// Parse table row here
}

This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:

<?php
require('simple_html_dom.php');

$table = array();

$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
$time = $row->find('td',0)->plaintext;
$artist = $row->find('td',1)->plaintext;
$title = $row->find('td',2)->plaintext;

$table[$artist][$title] = true;
}

echo '<pre>';
print_r($table);
echo '</pre>';

?>

We can see that this code can be (trivially) changed to reformat the data in any other way as well.

Parse HTML Table - PHP

As you're prepared to look beyond PHP, Nokogiri (Ruby) and Beautiful Soup (Python) are well-established libraries that parse HTML very well.

That doesn't imply that there are no suitable PHP libraries.

Parse HTML table in php

Try iterating over the child nodes of the P elements:

foreach ($rows as $row)
{
$paragraphs = $row->getElementsByTagName('p');
//ensure that all the text between <br> is in one text node
$paragraphs->item(0)->normalize();
foreach($paragraphs->item(0)->childNodes as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
echo $node->nodeValue . '<br/>;
}
}

}

It is important to call normalize() on the p element, to ensure that the texts between br elements are in one text node each, and not separated, eg <p>Calories (kcal)<br>Energy (kj)<br>...</p> will have text nodes of Calories (kcal) and Energy (kj), not Cal, ories (, kcal) and so on, which they might be without normalizing.



Related Topics



Leave a reply



Submit