File_Get_Contents() Breaks Up Utf-8 Characters

file_get_contents() Breaks Up UTF-8 Characters

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

How fix UTF-8 Characters in PHP file_get_contents()

According to PHP manual:

you can use this code, if you have problem with file_get_contents!

<?php
function file_get_contents_utf8($fn) {
     $content = file_get_contents($fn);
      return mb_convert_encoding($content, 'UTF-8',
          mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>

file_get_contents not working with utf8

Change your Accept-Charset to UTF-8 because ISO-8859-1 does not support Thai characters. If you are running your PHP script on a windows machine, you may also use the windows-874 charset, and you may also try adding this header :

Content-Language: th

But in most cases, UTF-8 will handle pretty much most characters or character sets without any other declaration.

** UPDATE **

Very strange, but this works for me.

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=> implode("\r\n", array(
                   'Content-type: text/plain; charset=TIS-620'
                   //'Content-type: text/plain; charset=windows-874'  // same thing
                ))
  )
);

$context = stream_context_create($opts);

//$fp = fopen('http://thaipope.org/webbible/01_002.htm', 'rb', false, $context);
//$contents = stream_get_contents($fp);
//fclose($fp);
$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);

header('Content-type: text/html; charset=TIS-620');
//header('Content-type: text/html; charset=windows-874');  // same thing

echo $contents;

Apparently, I was wrong for this one about UTF-8. See here for more details. Though you can still have an UTF-8 output :

$in_charset = 'TIS-620';   // == 'windows-874'
$out_charset = 'utf-8';

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=> implode("\r\n", array(
                   'Content-type: text/plain; charset=' . $in_charset
                ))
  )
);

$context = stream_context_create($opts);

$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);
if ($in_charset != $out_charset) {
    $contents = iconv($in_charset, $out_charset, $contents);
}

header('Content-type: text/html; charset=' . $out_charset);

echo $contents;   // output in UTF-8

file_get_contents show characters of utf-8 like question marks

[php]

//charset.php?case=1
//charset.php?case=2
//charset.php?case=3

$case = isset($_GET['case']) ? $_GET['case'] : 1;

if( !in_array($case,range(1,3)) ) $case = 1;


if( $case==1 ) {
    header("Content-type: text/html; charset=tis-620"); //http://htmlpurifier.org/docs/enduser-utf8.html
    $str = "https://www.google.co.th/search?q=sd";
}

if( $case==2 ) {
    header("Content-type: text/html; charset=ISO-8859-1");
    $str = "https://www.google.de/search?q=sd";
}   

if( $case==3 ) {
    header("Content-type: text/html; charset=ISO-8859-9");
    $str = "https://www.google.com.tr/search?q=sd";
}


$data = file_get_contents($str);
echo $data;

[/php]

as you can see ... the correct charset in php header is the solution

file_get_contents return strange symbols instead website content

$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'https://olbi.su');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
$content = curl_exec ($ch);

print_r($content);

This will show you the whole page like you expect. Use curl library so you can apply encoding.

file_get_contents - Special characters in URL - Special case

URLs cannot contain "Ö"! Start from this basic premise. Any characters not within a narrowly defined subset of ASCII must be URL-encoded to be represented within a URL. The right way to do that is to urlencode or rawurlencode (depending on which format the server expects) the individual segment of the URL, not the URL as a whole.

E.g.:

$url = sprintf('https://se.timeedit.net/web/liu/db1/schema/s/s.html?tab=3&object=%s&type=subgroup&startdate=20150101&enddate=20300501',
               rawurlencode('CM_949A11_1534_1603_DAG_DST_50_ÖVRIGT_1_1'));

You will still need to use the correct encoding for the string! Ö in ISO-8859-1 would be URL encoded to %D6, while in UTF-8 it would be encoded to %C3%96. Which one is the correct one depends on what the server expects.

How to get file content with a proper utf-8 encoding using file_get_contents?

How about this one????

For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

Sample Image

This code works for me

<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

The problem is not with file_get_contents()

I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);

UPDATE

Seems to be one problematic character as shown here.
It also is seen on the HTML image below. Renders as ¾

Its Hex value is xBE (190 decimal)

I tried these two character sets. Neither worked.

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');

Sample Image

END OF UPDATE

It works by adding a header WITHOUT charset=utf-8.

These two headers work

header('Content-Type: text/plain');
header('Content-Type: text/html');

These two headers do NOT work

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

This code is tested and displayed all characters.

<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

Sample Image

<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

Sample Image

These are some of the problematic characters with their Hex values.

This is the saved file viewed in Notepad++ with UTF-8 Encoding.

Sample Image

Check the Hex values against these character sets.

Sample Image

From the above table I saw the character set was Latin2.

I went to Wikipedia Windows code page and found that Latin2 is Windows-1250

bergamot, citrón, tráva, rebarbora, bazalka;levanduľa, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

File_Get_Contents() Breaks Up Utf-8 Characters