How to Use Filesystem Functions in PHP, Using Utf-8 Strings

How do I use filesystem functions in PHP, using UTF-8 strings?

Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).

Caveats (all apply to the solutions below as well):

  • After url-encoding, the filename must be less that 255 characters (probably bytes).
  • UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
  • You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).

Worse Solutions

The following are less attractive solutions, more complicated and with more caveats.

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.

  2. Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.

Caveats galore!

  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.

This nightmare is why you should probably just transliterate to create filenames.

PHP and Linux filesystem names in utf-8

The import thing to remember is that in Linux, filenames don't have a character encoding and instead are just an 8bit strings.

For example, if you upload a file via FTP and the FTP server uses Windows-1252 character encoding, the filename will be 8bit Windows-1252. Trying to open the file using a UTF-8 characters will fail, no matter what the locale or LANG is.

This is unlike OS X, where the filename is always UTF-8, and Windows where the filename is always UTF-16.

As you'll probably found strings in PHP are also just 8bit strings, so it's impossible to know for sure what encoding is being used for a string - You can easily have two strings that are encoded to different character sets.

My advice is to ensure that you know the encoding for any string your read or output including form fields and filenames.

Therefore, make sure the filename on disk is UTF-8 and the filename value you put into the database is UTF-8. Then, when you pull the value from the DB, the file variable should be UTF-8 encoded already and will be ready to pass to the fopen command.

PHP - Upload utf-8 filename

I'm on Windows 8 chinese version, and I deal with similar problem with this:

$filename = iconv("utf-8", "cp936", $filename);

cp stands for Code page and cp936 stands for Code page 936, which is the default code page of simplified chinese version of Windows.


So I think maybe your problem could be solved in a similar way:

$fn2 = iconv("UTF-8","cp1258", $base_dir.$fn);

I'm not quite sure whether the default code page of your OS is 1258 or not, you should check it yourself by opening command prompt and type in command chcp. Then change 1258 to whatever the command give you.

UPDATE

It seems that PHP filesystem functions can only handle characters that are in system codepage, according to this answer. So you have 2 choices here:

  1. Limit the characters in the filename to system codepage - in your case, it's 437. But I'm pretty sure that code page 437 does not include all the vietnamese characters.

  2. Change your system codepage to the vietnamese one: 1258 and convert the filename to cp1258. Then the filesystem functions should work.

Both choices are deficient:

Choice 1: You can't use vietnamese characters anymore, which is not what you want.

Choice 2: You have to change system code page, and filename characters are limited to code page 1258.

UPDATE

How to change system code page:

Go to Control Panel > Region > Administrative > Change system locale and select Vietnamese(Vietnam) in the drop down menu.

PHP: How to create unicode filenames

It can't currently be done on Windows (possibly PHP 5.4 will support this scenario). In PHP, you can only write filenames using the Windows set codepage. If the codepage, does not include the character , you cannot use it. Worse, if you have a file on Windows with such character in its filename, you'll have trouble accessing it.

In Linux, at least with ext*, it's a different story. You can use whatever filenames you want, the OS doesn't care about the encoding. So if you consistently use filenames in UTF-8, you should be OK. UTF-16 is however excluded because filenames cannot include bytes with value 0.



Related Topics



Leave a reply



Submit