Can a PHP File Name (Or a Dir in Its Full Path) Have Utf-8 Characters

Can a PHP file name (or a dir in its full path) have UTF-8 characters?

I have come across the same problem and done some research and conclude the following. This is for php5 on Windows; it is probably true on other platforms but I haven't checked.

  1. ALL php file system functions (dir, is_dir, is_file, file, filemtime, filesize, file_exists etc) only accept and return file names in ISO-8859-1, irrespective of the default_charset set in the program or ini files.

  2. Where a filename contains a unicode character dir->read will return it as the corresponding ISO-8859-1 character if there is one, otherwise it will substitute a question mark.

  3. When referencing a file, e.g. in is_file or file, if you pass in a UTF-8 file name the file will not be found when the name contains any two-byte or more characters. However, is_file(utf8_decode($filename)) etc will work providing the UTF-8 character is representable in ISO-8859-1.

In other words, PHP5 is not capable of addressing files with multi-byte characters in their names at all.

If a UTF-8 URL with multibyte characters is requested and this corresponds directly to a file, PHP won't be able to open the file because it cannot address it.

If you simply want pretty URLs in your language the suggestion of using mod_rewrite seems like a good one.

But if you are storing and retrieving files uploaded and downloaded by users, this problem has to be resolved. One way is to use an arbitrary (non UTF-8) file name, such as an incrementing number, on the server and index the files in a database or XML file or some such. Another way is to store the files in the database itself as a BLOB. Another way (which is perhaps easier to see what is going on, and not subject to problems if your index gets corrupted) is to encode the filenames yourself - a good technique is to urlencode (sic) all your incoming filenames when storing on the server disk and urldecode them before setting the filename in the mime header for the download. All even vaguely unusual characters (except %) are then encoded as %nn and so any problems with spaces in file names, cross platform support and pattern matching are largely avoided.

How do I use filesystem functions in PHP, using UTF-8 strings?

Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).

Caveats (all apply to the solutions below as well):

  • After url-encoding, the filename must be less that 255 characters (probably bytes).
  • UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
  • You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).

Worse Solutions

The following are less attractive solutions, more complicated and with more caveats.

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.

  2. Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.

Caveats galore!

  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.

This nightmare is why you should probably just transliterate to create filenames.

How to open file in PHP that has unicode characters in its name?

These are conclusions so far:

  1. PHP 5 can not open filename with unicode characters unless the source filename is unicode.
  2. PHP 5 (at least on windows XP) is not able to process PHP source in unicode.

Thus the conclusion this not doable in PHP 5.

Make PHP pathinfo() return the correct filename if the filename is UTF-8

A temporary work-around for this problem appears to be to make sure there is a 'normal' character in front of the accented characters, like so:

function getFilename($path)
{
// if there's no '/', we're probably dealing with just a filename
// so just put an 'a' in front of it
if (strpos($path, '/') === false)
{
$path_parts = pathinfo('a'.$path);
}
else
{
$path= str_replace('/', '/a', $path);
$path_parts = pathinfo($path);
}
return substr($path_parts["filename"],1);
}

Note that we replace all occurrences of '/' with '/a' but this is okay, since we return starting at offset 1 of the result. Interestingly enough, the dirname part of pathinfo() does seem to work, so no workaround is needed there.

PHP glob directory UTF-8

As case you can use:

<?php
class Encoding {

protected static $win1252ToUtf8 = array(
128 => "\xe2\x82\xac",

130 => "\xe2\x80\x9a",
131 => "\xc6\x92",
132 => "\xe2\x80\x9e",
133 => "\xe2\x80\xa6",
134 => "\xe2\x80\xa0",
135 => "\xe2\x80\xa1",
136 => "\xcb\x86",
137 => "\xe2\x80\xb0",
138 => "\xc5\xa0",
139 => "\xe2\x80\xb9",
140 => "\xc5\x92",

142 => "\xc5\xbd",

145 => "\xe2\x80\x98",
146 => "\xe2\x80\x99",
147 => "\xe2\x80\x9c",
148 => "\xe2\x80\x9d",
149 => "\xe2\x80\xa2",
150 => "\xe2\x80\x93",
151 => "\xe2\x80\x94",
152 => "\xcb\x9c",
153 => "\xe2\x84\xa2",
154 => "\xc5\xa1",
155 => "\xe2\x80\xba",
156 => "\xc5\x93",

158 => "\xc5\xbe",
159 => "\xc5\xb8"
);

protected static $brokenUtf8ToUtf8 = array(
"\xc2\x80" => "\xe2\x82\xac",

"\xc2\x82" => "\xe2\x80\x9a",
"\xc2\x83" => "\xc6\x92",
"\xc2\x84" => "\xe2\x80\x9e",
"\xc2\x85" => "\xe2\x80\xa6",
"\xc2\x86" => "\xe2\x80\xa0",
"\xc2\x87" => "\xe2\x80\xa1",
"\xc2\x88" => "\xcb\x86",
"\xc2\x89" => "\xe2\x80\xb0",
"\xc2\x8a" => "\xc5\xa0",
"\xc2\x8b" => "\xe2\x80\xb9",
"\xc2\x8c" => "\xc5\x92",

"\xc2\x8e" => "\xc5\xbd",

"\xc2\x91" => "\xe2\x80\x98",
"\xc2\x92" => "\xe2\x80\x99",
"\xc2\x93" => "\xe2\x80\x9c",
"\xc2\x94" => "\xe2\x80\x9d",
"\xc2\x95" => "\xe2\x80\xa2",
"\xc2\x96" => "\xe2\x80\x93",
"\xc2\x97" => "\xe2\x80\x94",
"\xc2\x98" => "\xcb\x9c",
"\xc2\x99" => "\xe2\x84\xa2",
"\xc2\x9a" => "\xc5\xa1",
"\xc2\x9b" => "\xe2\x80\xba",
"\xc2\x9c" => "\xc5\x93",

"\xc2\x9e" => "\xc5\xbe",
"\xc2\x9f" => "\xc5\xb8"
);

protected static $utf8ToWin1252 = array(
"\xe2\x82\xac" => "\x80",

"\xe2\x80\x9a" => "\x82",
"\xc6\x92" => "\x83",
"\xe2\x80\x9e" => "\x84",
"\xe2\x80\xa6" => "\x85",
"\xe2\x80\xa0" => "\x86",
"\xe2\x80\xa1" => "\x87",
"\xcb\x86" => "\x88",
"\xe2\x80\xb0" => "\x89",
"\xc5\xa0" => "\x8a",
"\xe2\x80\xb9" => "\x8b",
"\xc5\x92" => "\x8c",

"\xc5\xbd" => "\x8e",

"\xe2\x80\x98" => "\x91",
"\xe2\x80\x99" => "\x92",
"\xe2\x80\x9c" => "\x93",
"\xe2\x80\x9d" => "\x94",
"\xe2\x80\xa2" => "\x95",
"\xe2\x80\x93" => "\x96",
"\xe2\x80\x94" => "\x97",
"\xcb\x9c" => "\x98",
"\xe2\x84\xa2" => "\x99",
"\xc5\xa1" => "\x9a",
"\xe2\x80\xba" => "\x9b",
"\xc5\x93" => "\x9c",

"\xc5\xbe" => "\x9e",
"\xc5\xb8" => "\x9f"
);

static function toUTF8($text){
/**
* Function Encoding::toUTF8
*
* This function leaves UTF-8 characters alone, while converting
* almost all non-UTF8 to UTF8.
*
* It assumes that the encoding of the original string is
* either Windows-1252 or ISO 8859-1.
*
* It may fail to convert characters to UTF-8 if they fall
* into one of these scenarios:
*
* 1) when any of these characters: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
* are followed by any of these: ("group B")
* ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶•¸¹º»¼½¾¿
*
* For example: %ABREPRESENT%C9%BB. «REPRESENTÉ»
* The "«" (%AB) character will be converted, but the "É"
* followed by "»" (%C9%BB) is also a valid unicode
* character, and will be left unchanged.
*
* 2) when any of these: àáâãäåæçèéêëìíîï are followed by TWO
* characters from group B,
*
* 3) when any of these: ðñòó are followed by THREE
* characters from group B.
*
* @name toUTF8
* @param string $text Any string.
* @return string The same string, UTF-8 encoded
*
*/

if(is_array($text))
{
foreach($text as $k => $v)
{
$text[$k] = self::toUTF8($v);
}
return $text;
} elseif(is_string($text)) {

$max = strlen($text);
$buf = "";
for($i = 0; $i < $max; $i++){
$c1 = $text{$i};
if($c1>="\xc0"){ // Should be converted to UTF-8, if it's not UTF-8 already
$c2 = $i+1 >= $max? "\x00" : $text{$i+1};
$c3 = $i+2 >= $max? "\x00" : $text{$i+2};
$c4 = $i+3 >= $max? "\x00" : $text{$i+3};
if($c1 >= "\xc0" & $c1 <= "\xdf"){ // Looks like 2 bytes UTF-8
if($c2 >= "\x80" && $c2 <= "\xbf"){ // Yeah, almost sure it's UTF-8 already
$buf .= $c1 . $c2;
$i++;
} else { // Not valid UTF-8. Convert it.
$cc1 = (chr(ord($c1) / 64) | "\xc0");
$cc2 = ($c1 & "\x3f") | "\x80";
$buf .= $cc1 . $cc2;
}
} elseif($c1 >= "\xe0" & $c1 <= "\xef"){ // Looks like 3 bytes UTF-8
if($c2 >= "\x80" && $c2 <= "\xbf" && $c3 >= "\x80" && $c3 <= "\xbf"){ // Yeah, almost sure it's UTF-8 already
$buf .= $c1 . $c2 . $c3;
$i = $i + 2;
} else { // Not valid UTF-8. Convert it.
$cc1 = (chr(ord($c1) / 64) | "\xc0");
$cc2 = ($c1 & "\x3f") | "\x80";
$buf .= $cc1 . $cc2;
}
} elseif($c1 >= "\xf0" & $c1 <= "\xf7"){ // Looks like 4 bytes UTF-8
if($c2 >= "\x80" && $c2 <= "\xbf" && $c3 >= "\x80" && $c3 <= "\xbf" && $c4 >= "\x80" && $c4 <= "\xbf"){ // Yeah, almost sure it's UTF-8 already
$buf .= $c1 . $c2 . $c3;
$i = $i + 2;
} else { // Not valid UTF-8. Convert it.
$cc1 = (chr(ord($c1) / 64) | "\xc0");
$cc2 = ($c1 & "\x3f") | "\x80";
$buf .= $cc1 . $cc2;
}
} else { // It doesn't look like UTF-8, but should be converted
$cc1 = (chr(ord($c1) / 64) | "\xc0");
$cc2 = (($c1 & "\x3f") | "\x80");
$buf .= $cc1 . $cc2;
}
} elseif(($c1 & "\xc0") == "\x80"){ // Needs conversion
if(isset(self::$win1252ToUtf8[ord($c1)])) { // Found in Windows 1252 special cases
$buf .= self::$win1252ToUtf8[ord($c1)];
} else {
$cc1 = (chr(ord($c1) / 64) | "\xc0");
$cc2 = (($c1 & "\x3f") | "\x80");
$buf .= $cc1 . $cc2;
}
} else { // It doesn't need convesion
$buf .= $c1;
}
}
return $buf;
} else {
return $text;
}
}

static function toWin1252($text) {
if(is_array($text)) {
foreach($text as $k => $v) {
$text[$k] = self::toWin1252($v);
}
return $text;
} elseif(is_string($text)) {
return utf8_decode(str_replace(array_keys(self::$utf8ToWin1252), array_values(self::$utf8ToWin1252), self::toUTF8($text)));
} else {
return $text;
}
}

static function toISO8859($text) {
return self::toWin1252($text);
}

static function toLatin1($text) {
return self::toWin1252($text);
}

static function fixUTF8($text){
if(is_array($text)) {
foreach($text as $k => $v) {
$text[$k] = self::fixUTF8($v);
}
return $text;
}

$last = "";
while($last <> $text){
$last = $text;
$text = self::toUTF8(utf8_decode(str_replace(array_keys(self::$utf8ToWin1252), array_values(self::$utf8ToWin1252), $text)));
}
$text = self::toUTF8(utf8_decode(str_replace(array_keys(self::$utf8ToWin1252), array_values(self::$utf8ToWin1252), $text)));
return $text;
}

static function UTF8FixWin1252Chars($text){
// If you received an UTF-8 string that was converted
// from Windows-1252 as it was ISO8859-1
// (ignoring Windows-1252 chars from 80 to 9F) use
// this function to fix it.
// See: http://en.wikipedia.org/wiki/Windows-1252

return str_replace(array_keys(self::$brokenUtf8ToUtf8), array_values(self::$brokenUtf8ToUtf8), $text);
}

static function removeBOM($str=""){
if(substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str=substr($str, 3);
}
return $str;
}
}
?>

For using it, you need to include the script with this class and use it like:

Encoding::toUtf8('Bankdrücken');

PHP Unicode file name

It is not possible.

Here is the thread explaining why

Can a PHP file name (or a dir in its full path) have UTF-8 characters?

how to use UTF8 character in path Address for Scandir php

ScanDir good work in PHP 5.6 LINUX but upload files only with FTP and dont work in PHP 5.6 windows for fix this in windows you are use WFIO ext or Upgrade to PHP 7.1

Unicode issue with PHP

I am using win7 ntfs

Sorry, PHP running under Windows can't support filenames containing general Unicode characters. It can only cope with filenames made entirely of characters that lie within the current code page.

That code page is probably 1252 for you (Western European, similar to ISO-8859-1), which doesn't contain Cyrillic. If you run it on a Russian-language install then your code page would be 1251, and the Cyrillic characters would work - but accented Latin would break.

This is a problem that affects all applications that use the standard C stdio library calls from the MS C runtime, including PHP, Java and others. (Some languages, like Python, have special support for Unicode filenames using Windows-specific APIs instead of the C stdlib; there is Request 45517 to get the same into PHP but don't hold your breath.)

On non-Windows platforms, Unicode tends to be supported by using byte strings with the UTF-8 encoding, and so all Unicode characters just work. Unfortunately Windows does not have this capability (code page 65001 is kind-of UTF-8, but badly broken).



Related Topics



Leave a reply



Submit