file_get_contents() Breaks Up UTF-8 Characters
Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.
See this question: Why Does DOM Change Encoding?
How fix UTF-8 Characters in PHP file_get_contents()
According to PHP manual:
you can use this code, if you have problem with file_get_contents!
<?php
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>
file_get_contents not working with utf8
Change your Accept-Charset
to UTF-8
because ISO-8859-1 does not support Thai characters. If you are running your PHP script on a windows machine, you may also use the windows-874
charset, and you may also try adding this header :
Content-Language: th
But in most cases, UTF-8 will handle pretty much most characters or character sets without any other declaration.
** UPDATE **
Very strange, but this works for me.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=> implode("\r\n", array(
'Content-type: text/plain; charset=TIS-620'
//'Content-type: text/plain; charset=windows-874' // same thing
))
)
);
$context = stream_context_create($opts);
//$fp = fopen('http://thaipope.org/webbible/01_002.htm', 'rb', false, $context);
//$contents = stream_get_contents($fp);
//fclose($fp);
$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);
header('Content-type: text/html; charset=TIS-620');
//header('Content-type: text/html; charset=windows-874'); // same thing
echo $contents;
Apparently, I was wrong for this one about UTF-8. See here for more details. Though you can still have an UTF-8 output :
$in_charset = 'TIS-620'; // == 'windows-874'
$out_charset = 'utf-8';
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=> implode("\r\n", array(
'Content-type: text/plain; charset=' . $in_charset
))
)
);
$context = stream_context_create($opts);
$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);
if ($in_charset != $out_charset) {
$contents = iconv($in_charset, $out_charset, $contents);
}
header('Content-type: text/html; charset=' . $out_charset);
echo $contents; // output in UTF-8
file_get_contents show characters of utf-8 like question marks
[php]
//charset.php?case=1
//charset.php?case=2
//charset.php?case=3
$case = isset($_GET['case']) ? $_GET['case'] : 1;
if( !in_array($case,range(1,3)) ) $case = 1;
if( $case==1 ) {
header("Content-type: text/html; charset=tis-620"); //http://htmlpurifier.org/docs/enduser-utf8.html
$str = "https://www.google.co.th/search?q=sd";
}
if( $case==2 ) {
header("Content-type: text/html; charset=ISO-8859-1");
$str = "https://www.google.de/search?q=sd";
}
if( $case==3 ) {
header("Content-type: text/html; charset=ISO-8859-9");
$str = "https://www.google.com.tr/search?q=sd";
}
$data = file_get_contents($str);
echo $data;
[/php]
as you can see ... the correct charset in php header is the solution
file_get_contents return strange symbols instead website content
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'https://olbi.su');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
$content = curl_exec ($ch);
print_r($content);
This will show you the whole page like you expect. Use curl library so you can apply encoding.
file_get_contents - Special characters in URL - Special case
URLs cannot contain "Ö"! Start from this basic premise. Any characters not within a narrowly defined subset of ASCII must be URL-encoded to be represented within a URL. The right way to do that is to urlencode
or rawurlencode
(depending on which format the server expects) the individual segment of the URL, not the URL as a whole.
E.g.:
$url = sprintf('https://se.timeedit.net/web/liu/db1/schema/s/s.html?tab=3&object=%s&type=subgroup&startdate=20150101&enddate=20300501',
rawurlencode('CM_949A11_1534_1603_DAG_DST_50_ÖVRIGT_1_1'));
You will still need to use the correct encoding for the string! Ö
in ISO-8859-1 would be URL encoded to %D6
, while in UTF-8 it would be encoded to %C3%96
. Which one is the correct one depends on what the server expects.
How to get file content with a proper utf-8 encoding using file_get_contents?
How about this one????
For this one I used header('Content-Type: text/plain;; charset=Windows-1250');
bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn
This code works for me
<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
The problem is not with file_get_contents()
I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.
$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);
UPDATE
Seems to be one problematic character as shown here.
It also is seen on the HTML image below. Renders as ¾
Its Hex value is xBE (190 decimal)
I tried these two character sets. Neither worked.
header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');
END OF UPDATE
It works by adding a header WITHOUT charset=utf-8.
These two headers work
header('Content-Type: text/plain');
header('Content-Type: text/html');
These two headers do NOT work
header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');
This code is tested and displayed all characters.
<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>
These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.
Check the Hex values against these character sets.
From the above table I saw the character set was Latin2.
I went to Wikipedia Windows code page and found that Latin2 is Windows-1250
bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn
Related Topics
How to Increase Maximum Execution Time in PHP
PHP Simplexml How to Save the File in a Formatted Way
PHP.Ini & Smtp= - How to Pass Username & Password
How to Properly Url Encode a String in PHP
A Non Well Formed Numeric Value Encountered
Where Does PHP'S Error Log Reside in Xampp
Convert Dot Syntax Like "This.That.Other" to Multi-Dimensional Array in PHP
PHP If Statement With Multiple Conditions
MySQLi Equivalent of MySQL_Result()
How to Create a Simple 'Hello World' Module in Magento
PHP, Get File Name Without File Extension
Upload Photo to Album With Facebook'S Graph API
File_Put_Contents - Failed to Open Stream: Permission Denied
How to Gracefully Handle Files That Exceed PHP'S 'Post_Max_Size'