Web scraping in PHP
I recommend you consider simple_html_dom for this. It will make it very easy.
Here is a working example of how to pull the title, and first image.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://www.google.com/');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
Here is a second example that will do the same without an external library. I should note that using regex on HTML is NOT a good idea.
<?php
$data = file_get_contents('http://www.google.com/');
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
echo $title."<br>\n";
echo $img;
?>
Simple web scraping in PHP
There are a couple ways to scrape websites, one would be to use CSS Selectors and another would be to use XPath, which both select elements from the DOM.
Since I can't see the full HTML of the webpage it would be hard for me to determine which method is better for you. There is another option which may be frowned upon, but in this case it might work.
You could use a Regex (regular expressions) to find the characters, I'm not the best at regular expressions but here is some sample code of how that might work:
<?php
$subject = "<html><body><p>Some User</p><p>User status: Online.</p></body></html>";
$pattern = '/User status: (.*)\<\/p\>/';
preg_match($pattern, $subject, $matches);
print_r($matches);
?>
Sample output:
Array
(
[0] => User status: Online.</p>
[1] => Online.
)
Basically what the regex above is doing is matching a pattern, in this case it looks for the string "User status: " then matches all the characters (.*) up to the ending paragraph tag (escaped).
Here is the pattern that will return just "Online" without the period, wasn't sure if all statuses ended in a period but here is what it would look like:
'/User status: (.*)\.\<\/p\>/'
Website Scraping Using PHP
I'm not an xpath guru, but what I would do is to target first that particular table using that needle categories, then from there get those rows based on that and start looping on found rows.
Rough example:
$grep = new DOMDocument();
@$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DOMXpath($grep);
$products = array();
$nodes = $finder->query("
//td[@class='showroom1'][contains(text(), 'CATEGORIES')]
/parent::tr/parent::table/parent::td/parent::tr
/following-sibling::tr
/td[1]/table/tr/td/table/tr
");
if($nodes->length > 0) {
foreach($nodes as $tr) {
if($finder->evaluate('count(./td/a)', $tr) > 0) {
foreach($finder->query('./td/a[@class="cate_menu"]', $tr) as $row) {
$text = $row->nodeValue;
$number = $finder->query('./following-sibling::text()', $row)->item(0)->nodeValue;
$products[] = "$text $number";
}
}
}
}
echo '<pre>';
print_r($products);
Sample Output
How to scrape a large table from a php website using R
The page source is generated by JS. Here is what you do:
- Open the Dev Tool of the browser and go to the Network tab.
- Click on one of the pages and see what's going on (I clicked to page 4). You can see that the page sent a POST request to
https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php
and get the content of it.
Here are the parameters: - Mimic the POST request by
rvest
. Here is the code to scrape all pages:
library(rvest)
url <- "https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php"
pg <- html_session(url)
data <-
purrr::map_dfr(
1:4288, # you might wanna change it to a small number to try first or scrape multiple times and combine data frames later, in case something happens in the middle
function(i) {
pg <- rvest:::request_POST(pg,
url,
body = list(
page = i
))
read_html(pg) %>%
html_node("table") %>%
html_table()
}
)
PHP Web Scraping from table HTML tags
Yes, i would recommend using Xpath
<h1>This is scraping flight radar:</h1>
<?php
$url = "https://www.flightradar24.com/data/flights/southwest-airlines-wn-swa";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if($doc->loadHTML($html))
{
$result = new \DOMDocument();
$result->formatOutput = true;
$table = $result->appendChild($result->createElement("table"));
$thead = $table->appendChild($result->createElement("thead"));
$tbody = $table->appendChild($result->createElement("tbody"));
$xpath = new \DOMXPath($doc);
$newRow = $thead->appendChild($result->createElement("tr"));
foreach($xpath->query("//table[@id='tbl-datatable']/thead/tr/th[position()>1]") as $header)
{
$newRow->appendChild($result->createElement("th", trim($header->nodeValue)));
}
foreach($xpath->query("//table[@id='tbl-datatable']/tbody/tr") as $row)
{
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach($xpath->query("./td[position()>1 and position()<7]", $row) as $cell)
{
$newRow->appendChild($result->createElement("td", trim($cell->nodeValue)));
}
}
echo $result->saveXML($result->documentElement);
}
?>
Related Topics
Strict Standards: MySQLi_Next_Result() Error With MySQLi_Multi_Query
Print Numeric Values to Two Decimal Places
PHPdoc Type Hinting For Array of Objects
How to Post Pictures to Instagram Using API
How Do PHP Sessions Work? (Not "How Are They Used")
How to Test a Url For 404 in PHP
Getting the Names of All Files in a Directory With PHP
PHP: Is MySQL_Real_Escape_String Sufficient For Cleaning User Input
PHP - Insert a Variable in an Echo String
Detect Language from String in PHP
MySQL Prepared Statements With a Variable Size Variable List
Sum Values of Multidimensional Array by Key Without Loop
How to Echo or Print an Array in PHP
Get the Client Ip Address Using PHP
Declaration of Methods Should Be Compatible With Parent Methods in PHP