Using Xpath with PHP to Parse HTML

Using Xpath with php to parse html from a website

Basically, just target the table div/table which has that name of the show and the timeslot.

Rough example:

// it seems it doesn't work when there is no user agent
$ch = curl_init('http://www.starplus.in/schedule.aspx');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXPath($dom);

$shows = array();
$tables = $xpath->query("//div[@class='sech_div_bg']/table"); // target that table

foreach ($tables as $table) {
$time_slot = $xpath->query('./tr[1]/td/span', $table)->item(0)->nodeValue;
$show_name = $xpath->query('./tr[3]/td/span', $table)->item(0)->nodeValue;
$shows[] = array('time_slot' => $time_slot, 'show_name' => $show_name);
echo "$time_slot - $show_name <br/>";
}

// echo '<pre>';
// print_r($shows);

Using Xpath to parse html from a website

Yes you can use that date to get the shows for that day. You can use that as a needle for that particular row table.

First target which row it will fall, and then get those rows. Example:

$dat = "Oct 18";
$ch = curl_init('http://www.starplus.in/schedule.aspx');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXPath($dom);

$shows = array();
$node_list = $xpath->query("
//td[@class='bdr_R_dot']/span[text() = '$dat']
/parent::td/following-sibling::td
/table/tr/td[3]/div/ul/li
");

echo $dat . '<br/><br/>';
foreach ($node_list as $el) {
$time_slot = $xpath->query('./div/table/tr[1]/td/span', $el)->item(0)->nodeValue;
$show_name = $xpath->query('./div/table/tr[3]/td/span', $el)->item(0)->nodeValue;

echo "$time_slot : $show_name <br/>";
}

Sample Output

Parsing an HTML page using curl and xpath in PHP

Here is your php script that is mining request by you data in nicely sorted array, you can see the results of script and change the structure as you need it. Cheers!

$html = file_get_contents("https://www.galliera.it/118");

$dom = new DOMDocument();
$dom->loadHTML($html);
$finder = new DOMXPath($dom);

// find all divs class row
$rows = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' row ')]");

$data = array();
foreach ($rows as $row) {
$groupName = $row->getElementsByTagName('h2')->item(0)->textContent;
$data[$groupName] = array();

// find all div class box
$boxes = $finder->query("./*[contains(concat(' ', normalize-space(@class), ' '), ' box ')]", $row);
foreach ($boxes as $box) {
$subgroupName = $box->getElementsByTagName('h3')->item(0)->textContent;
$data[$groupName][$subgroupName] = array();

$listItems = $box->getElementsByTagName('li');
foreach ($listItems as $k => $li) {

$class = $li->getAttribute('class');
$text = $li->textContent;

if (!strlen(trim($text))) {
// this should be the graph bar so kip it
continue;
}

// I see only integer numbers so I cast to int, otherwise you can change the type or event not cast it
$data[$groupName][$subgroupName][] = array('type' => $class, 'value' => (int) $text);
}
}
}

echo '<pre>' . print_r($data, true) . '</pre>';

and output is something like:

Array
(
[SAN MARTINO - 15:30] => Array
(
[ATTESA: 22] => Array
(
[0] => Array
(
[type] => rosso
[value] => 1
)

[1] => Array
(
[type] => giallo
[value] => 12
)

[2] => Array
(
[type] => verde
[value] => 7
)

[3] => Array
(
[type] => bianco
[value] => 2
)

)

[VISITA: 45] => Array
(
[0] => Array
(
[type] => rosso
[value] => 5
)
...

Parsing HTML with XPath and PHP

To just get all the P elements not within a table and only before the first h1, you can do

$xp = new DOMXPath($dom);
$expression = '//p[not(preceding::h1[1]) and not(ancestor::table)]';
foreach ($xp->query($expression) as $node) {
echo $dom->saveXml($node);
}

Demo on codepad

In general, if you know the position of the first h1 in the document, it is more performant to use a direct path to that element, instead of the // query which would search anywhere in the document. For instance, as an alternative you could also use the XPath given by Alejandro in the comments below:

/descendant::h1[1]/preceding::p[not(ancestor::table)]

If you want to create a new DOM Document from the nodes in the source document, you have to import the nodes into a new document.

// src document
$dom = new DOMDocument;
$dom->loadXML($xml);

// dest document
$new = new DOMDocument;
$new->formatOutput = TRUE;

// xpath setup
$xp = new DOMXPath($dom);
$expr = '//p[not(preceding::h1[1]) and not(ancestor::table)]';

// importing nodes into dest document
foreach ($xp->query($expr) as $node) {
$new->appendChild($new->importNode($node, TRUE));
}

// output dest document
echo $new->saveXML();

Demo on codepad


Some more additions

In your example, you used the error suppression operator. This is bad practise. If you want to disregard any parsing errors from DOM, use

libxml_use_internal_errors(TRUE); // catch any DOM errors with libxml
$dom = new DOMDocument; // remove the @ as it is bad practise
$dom->loadXML($xhtml); // use loadHTML if it's not valid XHTML
libxml_clear_errors(); // disregards any DOM related errors

Removing nodes with DOM is always the same approach. Find the node you want to remove. Get to it's parentNode and call removeChild on it with the node to be removed as the argument.

foreach ($dom->getElementsByTagName('foo') as $node) {
$node->parentNode->removeChild($node);
}

You can also navigate to sibling nodes (and child nodes) without XPath. Here is how to remove all following siblings after the first h1 element

$firstH1 = $dom->getElementsByTagName('h1')->item(0);
while ($firstH1->nextSibling !== NULL) {
$firstH1->parentNode->removeChild($firstH1->nextSibling);
}
echo $dom->saveXml();

Removing nodes from the DOMDocument, will affect the DOMDocument immediately. In the code above, we are always querying for the first following sibling of the first h1. If there is one, it is removed from the DOMDocument. nextSibling will then point to the sibling after the one just removed (if any).


Fetching and printing all paragraphs is equally easy. To get the outerXML, just pass the node for which you want the outerXML to the saveXML method.

foreach ($dom->getElementsByTagName('p') as $paragraph)
{
echo $dom->saveXml($paragraph);
}

Anyway, that should get you going. I suggest you familiarize yourself with the DOM API. It's not difficult. You will find that most of the things you will do revolve around properties and method in either DOMDocument, DOMNode and DOMElement (which is a subclass of DOMNode).

PHP Getting Text and Href from HTML page using XPATH

Focusing only on extracting the data (and not on formatting, etc.) and assuming your html is fixed like below, try something along the lines of:

 $str = '
<tbody>
<tr>
<td>
19-10-2020 @ 17:33
</td>
<td class="hidden-xs hidden-sm">
<a href="#" data-identifier="5f8db1c332ea9b22d375b7c0"></a>
</td>
</tr>
</tbody>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$doc = simplexml_import_dom($doc);
$dates = $doc->xpath('//td[1]');
$identifiers = $doc->xpath('//td/a[@href]/@data-identifier');

foreach(array_combine($dates, $identifiers) as $date => $identifier) {
echo trim($date) . "\n";
echo trim($identifier) . "\n";
}

Output:

19-10-2020 @ 17:33
5f8db1c332ea9b22d375b7c0

How to parse html with php DomXpath, modify and save

Solution is simple:
With set of methods such as createElement, setAttribute and appendChild I solved my problem, example as follows:

$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($output->getHTML(), 'HTML-ENTITIES', 'utf-8'));
$xpath = new DOMXPath($dom);
$tableProp = $xpath->query('//*[@class="smwb-factbox"][2]');
...
$th_el = $dom->createElement('th', $th_outer_inner_span_a_el);
...
$td_el = $dom->createElement('td', '');
$td_el->appendChild($td_el_outer_span);
$tr_el = $dom->createElement('tr', '');
$tr_el->setAttribute('class', 'smwb-propvalue');
$tr_el->appendChild($th_el);
$tr_el->appendChild($td_el);
$tableProp->item(0)->appendChild($tr_el);
$dom->saveHTML();
...

The idea is pretty simple.
I have table in mediawiki, find it, create new row and insert it, after save it. That's all.

Parsing HTML with PHP DOMXpath

It can be done, but because of the limited xpath support, it's not the most elegant solution.

Starting from $nodeList; given that your sample xml has only 3 events, this code will output the required information about the first two. Obviously, you can modify it for your actual code:

$nodeList = $xpath->query('//div[./div[@class="recording-item-inner"]]//div[@class="info"]');
$i = 1;
echo htmlspecialchars("<ul>", ENT_QUOTES);
echo "<br>";
foreach($nodeList as $result) {
if ($i++ > 2) break;
echo htmlspecialchars("<li>", ENT_QUOTES);
echo "Event 1 - " . $result->childNodes[1]->textContent . ", ";
echo $result->childNodes[4]->textContent . ", ";
echo $result->parentNode->getAttribute('href');
echo htmlspecialchars("</li>", ENT_QUOTES);
echo "<br>";
}
echo htmlspecialchars("</ul>", ENT_QUOTES);

Output:

<ul>
<li>Event 1 - Daily Event, 29 Jun 2020, /recordings/191</li>
<li>Event 1 - Daily Event B, 26 Jun 2020, /recordings/190</li>
</ul>

Retrieve data from html page using xpath and php

You have some errors in your code :

  1. You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.

  2. You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)

  3. You assume there is a tbody element on the page but there is no one. So remove that from your query filter.

Working code :

$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
@$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[@id='table33']/tr[2]/td[3]/b");

foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}


Related Topics



Leave a reply



Submit