Decode Gzipped Web Page Retrieved Via Curl in PHP

Decode gzipped web page retrieved via cURL in PHP

I use curl and:

curl_setopt($ch, CURLOPT_ENCODING , "gzip");

php curl response show gzip or encoded data

In header i allowed gzip and deflate only and removed br and it worked for me. So instead of this $header[] = 'Accept-Encoding: gzip, deflate, br'; i used $header[] = 'Accept-Encoding: gzip, deflate';

Thanks for help every one.

How to decode Content-Encoding: gzip, gzip using curl?

You can decode it by trimming off the headers and using gzinflate.

$url = "http://www.dealstan.com"

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

$return = gzinflate(substr($return, 10));
print_r($return);

Requesting a GZIP'ed page and processing with cURL and PHP

You can request a gzipped encoding with curl_setopt, like this:

curl_setopt($curl, CURLOPT_ENCODING, 'gzip'); 

You can then decompress the content with gzdecode like this:

$response = gzdecode($response);

decode curl response gzip multipart attachment in PHP

It appears that the solution is really simple but didn't think about it before
Once I extracted the decoded attachment, all I needed is:

$xml_string = gzdecode($decoded_attachment);

and the result is the expected XML attachment

curl php not getting response

try to remove this $headers[]='Accept-Encoding:gzip,deflate,br';

How to find out in PHP, if the output will be gzipped by Apache?

After searching and trying out a bit, I found out several things.

1) There might be gzip compression which is executed via the zlib library. This can be deactivated on runtime:

ini_set('zlib.output_compression', false);

2) There might additionally be gzipping applied via an Apache module. It is not possible to see, wether or not this is going to happen after code execution, but there is a pretty reliable way to break it:

header("Content-Encoding: none");

This is not standard compliant, but it forces Apache to think that the provided content can possibly not be compressed. So it won't jump in.

There might be a lot of other situations (like nginx, or another gzipping extension, and so on), but in most of the cases, this combination of tricks will do the trick:

// disable zlib
ini_set('zlib.output_compression', false);
// Force termination of all instantiated buffers
while (@ob_end_flush());

// prevent apache from gzipping
header("Content-Encoding: none");

// prevent the browsers from showing a cached version before showing the new one
header('Cache-Control: no-cache');

// Start the output to enable buffering
header('Content-Type: text/html; charset=utf-8' );

// Push the beginning of the page to the browser
ob_flush();
flush();

// Do stuff here.

Hope this helps anyone...

How to parse response compressed using GZip in php?

quote "As you see, the response is .gz format", No it is not. the server SAYS Content-Type:application/x-gzip , but this wrong! it is an XML file, with the name "markets_20151127T065210.gz"

quote "how to parse that response compressed using GZip in php script and echo xml format? "
your xml is in $xml by line 4 here:

<?php
$ch=hhb_curl_init();
$xml=hhb_curl_exec2($ch,'http://services.eoddsmaker.net/demo/feeds/V1.0/markets.ashx?l=1&bid=43&sid=50&cid=58&lid=10&u=kwaninmacau&p=kwaninmacau',$headers,$cookies,$requeststring);
var_dump('headers:',$headers,'cookies:',$cookies,'requeststring:',$requeststring,'xml:',$xml);

function hhb_curl_init($custom_options_array = array())
{
if (empty($custom_options_array)) {
$custom_options_array = array();
//i feel kinda bad about this.. argv[1] of curl_init wants a string(url), or NULL
//at least i want to allow NULL aswell :/
}
if (!is_array($custom_options_array)) {
throw new InvalidArgumentException('$custom_options_array must be an array!');
}
;
$options_array = array(
CURLOPT_AUTOREFERER => true,
CURLOPT_BINARYTRANSFER => true,
CURLOPT_COOKIESESSION => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_FORBID_REUSE => false,
CURLOPT_HTTPGET => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 11,
CURLOPT_ENCODING => ""
//CURLOPT_REFERER=>'example.org',
//CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0'
);
if (!array_key_exists(CURLOPT_COOKIEFILE, $custom_options_array)) {
//do this only conditionally because tmpfile() call..
static $curl_cookiefiles_arr = array(); //workaround for https://bugs.php.net/bug.php?id=66014
$curl_cookiefiles_arr[] = $options_array[CURLOPT_COOKIEFILE] = tmpfile();
$options_array[CURLOPT_COOKIEFILE] = stream_get_meta_data($options_array[CURLOPT_COOKIEFILE]);
$options_array[CURLOPT_COOKIEFILE] = $options_array[CURLOPT_COOKIEFILE]['uri'];

}
//we can't use array_merge() because of how it handles integer-keys, it would/could cause corruption
foreach ($custom_options_array as $key => $val) {
$options_array[$key] = $val;
}
unset($key, $val, $custom_options_array);
$curl = curl_init();
if($curl===false){
throw new RuntimeException('could not create a curl handle! curl_init() returned false');
}
if(false===curl_setopt_array($curl, $options_array)){
$errno=curl_errno($curl);
$error=curl_error($curl);
throw new RuntimeException('could not set options on curl! curl_setopt_array returned false. curl_errno :'.$curl_errno.'. curl_error: '.$curl_error);
}
return $curl;
}
function hhb_curl_exec($ch, $url)
{
static $hhb_curl_domainCache = "";//warning, this will not work properly with 2 different curl's visiting 2 different sites.
//should probably use SplObjectStorage here, so each curl can have its own cache..
//$hhb_curl_domainCache=&$this->hhb_curl_domainCache;
//$ch=&$this->curlh;
if (!is_resource($ch) || get_resource_type($ch) !== 'curl') {
throw new InvalidArgumentException('$ch must be a curl handle!');
}
if (!is_string($url)) {
throw new InvalidArgumentException('$url must be a string!');
}

$tmpvar = "";
if (parse_url($url, PHP_URL_HOST) === null) {
if (substr($url, 0, 1) !== '/') {
$url = $hhb_curl_domainCache . '/' . $url;
} else {
$url = $hhb_curl_domainCache . $url;
}
}
;

if(false===curl_setopt($ch, CURLOPT_URL, $url)){
$errno=curl_errno($curl);
$error=curl_error($curl);
throw new RuntimeException('could not set CURLOPT_URL on curl! curl_setopt returned false. curl_errno :'.$curl_errno.'. curl_error: '.$curl_error.'. url: '.var_export($url,true));
}
$html = curl_exec($ch);
if (curl_errno($ch)) {
throw new Exception('Curl error (curl_errno=' . curl_errno($ch) . ') on url ' . var_export($url, true) . ': ' . curl_error($ch));
// echo 'Curl error: ' . curl_error($ch);
}
if ($html === '' && 203 != ($tmpvar = curl_getinfo($ch, CURLINFO_HTTP_CODE)) /*203 is "success, but no output"..*/ ) {
throw new Exception('Curl returned nothing for ' . var_export($url, true) . ' but HTTP_RESPONSE_CODE was ' . var_export($tmpvar, true));
}
;
//remember that curl (usually) auto-follows the "Location: " http redirects..
$hhb_curl_domainCache = parse_url(curl_getinfo($ch, CURLINFO_EFFECTIVE_URL), PHP_URL_HOST);
return $html;
}
function hhb_curl_exec2($ch, $url, &$returnHeaders = array(), &$returnCookies = array(), &$verboseDebugInfo = "")
{
$returnHeaders = array();
$returnCookies = array();
$verboseDebugInfo = "";
if (!is_resource($ch) || get_resource_type($ch) !== 'curl') {
throw new InvalidArgumentException('$ch must be a curl handle!');
}
if (!is_string($url)) {
throw new InvalidArgumentException('$url must be a string!');
}
$verbosefileh = tmpfile();
if($verbosefileh===false){
throw new RuntimeException('can not create a tmpfile for curl\'s stderr. tmpfile returned false');
}
$verbosefile = stream_get_meta_data($verbosefileh);
$verbosefile = $verbosefile['uri'];
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_STDERR, $verbosefileh);
curl_setopt($ch, CURLOPT_HEADER, 1);
$html = hhb_curl_exec($ch, $url);
$verboseDebugInfo = file_get_contents($verbosefile);
curl_setopt($ch, CURLOPT_STDERR, NULL);
fclose($verbosefileh);
unset($verbosefile, $verbosefileh);
$headers = array();
$crlf = "\x0d\x0a";
$thepos = strpos($html, $crlf . $crlf, 0);
$headersString = substr($html, 0, $thepos);
$headerArr = explode($crlf, $headersString);
$returnHeaders = $headerArr;
unset($headersString, $headerArr);
$htmlBody = substr($html, $thepos + 4); //should work on utf8/ascii headers... utf32? not so sure..
unset($html);
//I REALLY HOPE THERE EXIST A BETTER WAY TO GET COOKIES.. good grief this looks ugly..
//at least it's tested and seems to work perfectly...
$grabCookieName = function($str,&$len)
{
$len=0;
$ret = "";
$i = 0;
for ($i = 0; $i < strlen($str); ++$i) {
++$len;
if ($str[$i] === ' ') {
continue;
}
if ($str[$i] === '=') {
--$len;
break;
}
$ret .= $str[$i];
}
return urldecode($ret);
};
foreach ($returnHeaders as $header) {
//Set-Cookie: crlfcoookielol=crlf+is%0D%0A+and+newline+is+%0D%0A+and+semicolon+is%3B+and+not+sure+what+else
/*Set-Cookie:ci_spill=a%3A4%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22305d3d67b8016ca9661c3b032d4319df%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A14%3A%2285.164.158.128%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A109%3A%22Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F43.0.2357.132+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1436874639%3B%7Dcab1dd09f4eca466660e8a767856d013; expires=Tue, 14-Jul-2015 13:50:39 GMT; path=/
Set-Cookie: sessionToken=abc123; Expires=Wed, 09 Jun 2021 10:18:14 GMT;
//Cookie names cannot contain any of the following '=,; \t\r\n\013\014'
//
*/
if (stripos($header, "Set-Cookie:") !== 0) {
continue;
/**/
}
$header = trim(substr($header, strlen("Set-Cookie:")));
$len=0;
while (strlen($header) > 0) {
$cookiename = $grabCookieName($header,$len);
$returnCookies[$cookiename] = '';
$header = substr($header, $len + 1); //also remove the =
if (strlen($header) < 1) {
break;
}
;
$thepos = strpos($header, ';');
if ($thepos === false) { //last cookie in this Set-Cookie.
$returnCookies[$cookiename] = urldecode($header);
break;
}
$returnCookies[$cookiename] = urldecode(substr($header, 0, $thepos));
$header = trim(substr($header, $thepos + 1)); //also remove the ;
}
}
unset($header, $cookiename, $thepos);
return $htmlBody;
}

Is there any way to get curl to decompress a response without sending the Accept headers in the request?

Probably the easiest thing to do is just use gunzip to do it:

curl -sH 'Accept-encoding: gzip' http://example.com/ | gunzip -

Or there's also --compressed, which curl will decompress (I believe) since it knows the response is compressed. But, not sure if that meets your needs.



Related Topics



Leave a reply



Submit