How to parse HTML table using PHP?
For tidy HTML codes, one of the parsing approach can be DOM. DOM divides your HTML code into objects and then allows you to call the desired object and its values/tag name etc.
The official documentation of PHP HTML DOM parsing is available at http://php.net/manual/en/book.dom.php
For finding the values of second column for the given table following DOM implementation can be done:
<?php
$data = file_get_contents('http://mytemporalbucket.s3.amazonaws.com/code.txt');
$dom = new domDocument;
@$dom->loadHTML($data);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(1)->getElementsByTagName('tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
echo $cols[2];
}
?>
Reference: Customized the code provided at How to parse this table and extract data from it? to match this question's demand.
PHP DOM Parser parse html table
$field_names = ['username', 'phone', 'status'];
$result = [];
// Search for div tags having tbl-process-mobile class
$containers = $doc->getElementsByTagName('div');
foreach ($containers as $container) {
if (!isset($container->attributes['class']))
continue;
if (false === strpos($container->attributes['class']->value,
'tbl-process-mobile'))
continue;
// Assume that tbody tags are required
if (!$tbodies = $container->getElementsByTagName('tbody'))
continue;
// Get the first tbody (there should not be more)
if (!$tbodies->length || !$tbody = $tbodies->item(0))
continue;
foreach ($tbody->getElementsByTagName('tr') as $tr) {
$i = 0;
$row = [];
$cells = $tr->getElementsByTagName('td');
// Collect the first count($field_names) cell values as maximum
foreach ($field_names as $name) {
if (!$td = $cells->item($i++))
break;
$row[$name] = trim($td->textContent);
}
if ($row)
$result []= $row;
}
}
var_dump($result);
Sample Output
array(2) {
[0]=>
array(3) {
["username"]=>
string(14) "randomusername"
["phone"]=>
string(10) "0123456789"
["status"]=>
string(6) "active"
}
[1]=>
array(3) {
["username"]=>
string(15) "randomusername2"
["phone"]=>
string(10) "0987654321"
["status"]=>
string(6) "active"
}
}
No comments, as the code is self-explanatory.
P.S.: in the sense of parsing, the HTML structure leaves a lot to be desired.
How to extracting Data from HTML table using php
You can use xpath
to query('//td')
and retrieve the td
html using C14N()
, something like:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
Output:
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
C14N();
Returns canonicalized nodes as a
string
orFALSE
on failure
Update:
Another question, how can I grab individual Table Data? For example,
just grab, Job ID
Use XPath
contains
, i.e.:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
Update V2:
How can I get the next Table Data after that (to actually get the Job
Id) ?
Use following-sibling::*[1]
, i.e:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
Parse html table using file_get_contents to php array
Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.
I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.
In principle, with Simple HTML DOM you just write something like:
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
// Parse table row here
}
This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:
<?php
require('simple_html_dom.php');
$table = array();
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
$time = $row->find('td',0)->plaintext;
$artist = $row->find('td',1)->plaintext;
$title = $row->find('td',2)->plaintext;
$table[$artist][$title] = true;
}
echo '<pre>';
print_r($table);
echo '</pre>';
?>
We can see that this code can be (trivially) changed to reformat the data in any other way as well.
Parse HTML Table - PHP
As you're prepared to look beyond PHP, Nokogiri (Ruby) and Beautiful Soup (Python) are well-established libraries that parse HTML very well.
That doesn't imply that there are no suitable PHP libraries.
Parse HTML table in php
Try iterating over the child nodes of the P
elements:
foreach ($rows as $row)
{
$paragraphs = $row->getElementsByTagName('p');
//ensure that all the text between <br> is in one text node
$paragraphs->item(0)->normalize();
foreach($paragraphs->item(0)->childNodes as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
echo $node->nodeValue . '<br/>;
}
}
}
It is important to call normalize() on the p
element, to ensure that the texts between br
elements are in one text node each, and not separated, eg <p>Calories (kcal)<br>Energy (kj)<br>...</p>
will have text nodes of Calories (kcal)
and Energy (kj)
, not Cal
, ories (
, kcal)
and so on, which they might be without normalizing.
Related Topics
How to Detect Ambiguous and Invalid Datetime in PHP
PHP Variable Variables with Array Key
How to Solve "Non-Static Method Xxx:Xxx() Should Not Be Called Statically in PHP 5.4
Good Tutorial on How to Update Your MySQL Database with a PHP Form
Can You Pass by Reference While Using the Ternary Operator
Unserialize PHP Data in Python
How to Create PHP Two Column Table with Values from the Database
Can a PHP Function Accept an Unlimited Number of Parameters
PHP Loop Counter Bootstrap Row
Phpstorm 2020.2 - PHP Built-In Functions Are Not Recognized
Loop Through Wordpress Posts, and Wrap Each X Post in a Div
PHP Mail() Works from Command Line But Not Apache
Reverse Order of String Like "Hello Word" Reverse as "Word Hello" in PHP