PHP Curl Utf-8 Charset

PHP Curl UTF-8 Charset

Simple:
When you use curl it encodes the string to utf-8 you just need to decode them..

Description

string utf8_decode ( string $data )

This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.

HTML and PHP cURL response utf-8 encoding problem

Both pages are UTF-8 encoded, and cURL returns that as is. The problem is the following processing; assuming that libxml2 is involved, it tries to guess the encoding from <meta> elements, but if there are none, it assumes ISO-8859-1. It can be forced to assume UTF-8, if an UTF-8 BOM ("\xEF\xBB\xBF") is preprended to the HTML.

Getting correct encoding from php cURL

Chrome default encoding is UTF-8, and if you set it to to UTF-8
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8'); your text will be as expected you can try that here.

Also detecting the encoding is painful since it can encounter many issues using mb_detect_encoding but in this case it can be helpful if you specify the expected order of detection like so:

mb_detect_encoding($val, 'UTF-8,ISO-8859-15');

In my personal experience it is worthless without specifying the targets and in the right order, for example you need to list UTF-8 before ISO-8859-1 in your encoding_list or it will return ISO-8859-1 in most cases

UPDATE:

The doc says CURLOPT_ENCODING => '' handle all encodings you can try that but as I said since you are dealing with a known encoding wich is UTF-8 please try

$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_HTTPHEADER => $headers,
CURLOPT_ENCODING => 'UTF-8',
CURLOPT_RETURNTRANSFER => true
));

PHP CURL isn't processing encoded return data properly

set another curl option for CURLOPT_ENCODING and set it to "" to ensure it will not return any garbage

   curl_setopt($ch, CURLOPT_ENCODING ,"");

How do I encode content fetched via CURL in PHP?

After doing a lot more experimenting I stumbled on this solution, which fixed everything.

My script fetched the URL contents and loaded them into a DOM document like this:

$html = file_get_contents_curl($link_url);
$doc = new DOMDocument();
@$doc->loadHTML($html);

Per the linked article, I changed it to this:

$html = file_get_contents_curl($link_url);
$doc = new DOMDocument();
@$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

I also eliminated the use of utf8_decode.

And everything displayed properly.

php-curl dosen't support utf-8 in url

You just need to urlencode() the UTF-8 characters, like this:

php > $api = 'http://www.aparat.com/etc/api/';
php > $searchEndpoint = 'videoBySearch/text/';

php > var_dump($api . $searchEndpoint . urlencode('نوروز'));
string(79) "http://www.aparat.com/etc/api/videoBySearch/text/%D9%86%D9%88%D8%B1%D9%88%D8%B2"

php > $encodedUrl = $api . $searchEndpoint . urlencode('نوروز');
php > var_dump($encodedUrl);

string(79) "http://www.aparat.com/etc/api/videoBySearch/text/%D9%86%D9%88%D8%B1%D9%88%D8%B2"
php > var_dump(json_decode(file_get_contents($encodedUrl)));
object(stdClass)#1 (2) {
["videobysearch"]=>
array(20) {
[0]=>
object(stdClass)#2 (17) {
["id"]=>
string(7) "4126322"
["title"]=>
string(35) "استقبال از نوروز 1395"
["username"]=>
string(6) "tabaar"
...
}
...

Unicode is getting Encoded CURL PHP

First you have to identify the source website character encoding.

Choose a page and download it... using the terminal, type:

$ curl -D headers.txt -o page.html http:/www.example.com/index.html

The response headers are saved into headers.txt while the page source html is stored into page.html

Inspect the two files with a text editor and search for Content-Type you should find indication of the character encoding at least in one of them.

If you're not successfull you can use file to try to "guess" the character encoding by inspecting the file contents:

$ file -I page.html

The output looks like this:

page.html: text/plain; charset=iso-8859-1

Second you have to decide or understand what the destination character set is:

  • are you storing the web page into a text file? What is the expected character encoding of the file?

  • are you parsing the web page within PHP in order to fetch some data of your interest?

  • are you serving back the webpage (totally or partially) on your website? What is the character encoding of the website?

Let's assume (for example) you want to end up with Unicode characters encoded as UTF-8.


Finally improve your PHP script to make the proper charset conversion after the page is retrieved with $page = curl_exec($curl);

You may use mb-convert_encoding

$page = mb_convert_encoding( $page, 'ISO-8859-1', 'UTF-8' );
// from ----------^ ^--------to

Alternatively iconv can be used for the same purpose.



Related Topics



Leave a reply



Submit