Web scraping in PHP
I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
Simple web scraping in PHP
There are a couple ways to scrape websites, one would be to use CSS Selectors and another would be to use XPath, which both select elements from the DOM.
Since I can't see the full HTML of the webpage it would be hard for me to determine which method is better for you. There is another option which may be frowned upon, but in this case it might work.
You could use a Regex (regular expressions) to find the characters, I'm not the best at regular expressions but here is some sample code of how that might work:
<?php
$subject = "<html><body><p>Some User</p><p>User status: Online.</p></body></html>";
$pattern = '/User status: (.*)\<\/p\>/';
preg_match($pattern, $subject, $matches);
print_r($matches);
?>
Sample output:
Array
(
[0] => User status: Online.</p>
[1] => Online.
)
Basically what the regex above is doing is matching a pattern, in this case it looks for the string "User status: " then matches all the characters (.*) up to the ending paragraph tag (escaped).
Here is the pattern that will return just "Online" without the period, wasn't sure if all statuses ended in a period but here is what it would look like:
'/User status: (.*)\.\<\/p\>/'
Website Scraping Using PHP
I'm not an xpath guru, but what I would do is to target first that particular table using that needle categories, then from there get those rows based on that and start looping on found rows.
Rough example:
$grep = new DOMDocument();
@$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DOMXpath($grep);
$products = array();
$nodes = $finder->query("
//td[@class='showroom1'][contains(text(), 'CATEGORIES')]
/parent::tr/parent::table/parent::td/parent::tr
/following-sibling::tr
/td[1]/table/tr/td/table/tr
");
if($nodes->length > 0) {
foreach($nodes as $tr) {
if($finder->evaluate('count(./td/a)', $tr) > 0) {
foreach($finder->query('./td/a[@class="cate_menu"]', $tr) as $row) {
$text = $row->nodeValue;
$number = $finder->query('./following-sibling::text()', $row)->item(0)->nodeValue;
$products[] = "$text $number";
}
}
}
}
echo '<pre>';
print_r($products);
Sample Output
Web Scraping with PHP Goutte
The redmart.com website is using react js to generate the content. You cannot use a website scraper like Goutte. Instead, try using the developer console in Firefox or Google Chrome and see what's going on.
In this case, a url is requested (via ajax) that returns JSON format and is rendered by react: https://api.redmart.com/v1.6.0/catalog/search?q=apple&pageSize=18&sort=1024&variation=BETA
With PHP, you just use json_decode on the response and you have everything you need.
How to crawl page in PHP?
The site is protected by cloudflare. You can bypass the cloudflare when you have javascript enabled, so through command line is not going to work. You can however automate this by using Puppeteer for example, which also is available in PHP. But you have to disable headless to make it work.
Installation
composer require nesk/puphpeteer
npm install @nesk/puphpeteer
The script (test.php)
use Nesk\Puphpeteer\Puppeteer;
require_once __DIR__ . "/vendor/autoload.php";
function getToken($content)
{
preg_match_all('/.+?input type="hidden" name="csrfmiddlewaretoken" value="(.+?)".*/sim', $content, $matches);
return $matches[1][0];
}
$puppeteer = new Puppeteer;
$browser = $puppeteer->launch(['headless'=>false]);
/**
* @var $page \Nesk\Puphpeteer\Resources\Page
*/
$page = $browser->newPage();
$page->goto('https://v2.gcchmc.org/medical-status-search/');
var_dump(getToken($page->content()));
$browser->close();
Now you probably don't need the csrfmiddlewaretoken when running the script like this, but you can take it further from here if you chose to use this feature.
web scraping php specific div
Successfully tested now. Sometimes simpler is better. Divide and conquer, use explode and a function to grab a string from text that is between two other strings (in your case you want the contents of table column with "number" class and the close column tag (td)).
$htmlOficial = file_get_contents('https://www.dolarhoy.com/cotizaciondolaroficial');
$chunk = strbtw($htmlOficial, 'Banco Nacion', '</tr>');
$number_chunks = explode('class="number"', $chunk);
$ventaOficial = strbtw($number_chunks[1], '>', '</td>');
$compraOficial = strbtw($number_chunks[2], '>', '</td>');
echo "ventaOficial[{$ventaOficial}]<br/>";
echo "compraOficial[{$compraOficial}]<br/>";
function strbtw($text, $str1, $str2="", $trim=true) {
$len = strlen($str1);
$pos_str1 = strpos($text, $str1);
if ($pos_str1 === false) return "";
$pos_str1+=$len;
if (empty($str2)) { // try to search up to the end of line
$pos_str2 = strpos($text, "\n", $pos_str1);
if ($pos_str2 === false) $pos_str2 = strpos($text, "\r\n", $pos_str1);
}
else $pos_str2 = strpos($text, $str2, $pos_str1);
if ($pos_str2 !== false) {
if ($pos_str2-$pos_str1 === 0) $rez = substr($text, $pos_str1);
else $rez = substr($text, $pos_str1, $pos_str2-$pos_str1);
}
else $rez = substr($text, $pos_str1);
return ($trim) ? trim($rez) : ($rez);
}
Please let me know if it works.
Related Topics
List All Files in One Directory PHP
PHP Array to Json Array Using Json_Encode();
Opening/Closing Tags & Performance
PHP Sort Array by Two Field Values
Implode an Array With ", " and Add "And " Before the Last Item
Include PHP Inside JavaScript (.Js) Files
Disable Warnings When Loading Non-well-formed HTML by Domdocument (PHP)
Http Headers For File Downloads
PHP to Search Within Txt File and Echo the Whole Line
Trying to Get Property of Non-Object - Laravel 5
Converting a Number With Comma as Decimal Point to Float
Error Message Strict Standards: Non-Static Method Should Not Be Called Statically in PHP
How Can Strip Whitespaces in PHP'S Variable
How to Fake $_Server['Remote_Addr'] Variable
How to Remove Both .PHP and .Html Extensions from Url Using Nginx
How to Get Random Value Out of an Array
PHP Explode the String, But Treat Words in Quotes as a Single Word