What Character Sequence Should I Not Allow in a Filename

What character sequence should I not allow in a filename?

Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.

For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.

The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.

The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-
The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)

So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).

If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)

This leaves you at 23656340818315048885345458 combinations (still 84 bits).

Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.

This reduces the possible combinations to 17866587696996781449603 (73 bits).

Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.

You still have 13090925539866773438463 combinations (73 bits).

If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.

Now you have 4347792138495 possible filenames or 41 bits.

The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).

If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks like

ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_
which gives you 3512479453921 combinations (41 bits).

What characters are forbidden in Windows and Linux directory names?

A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
* " ? and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.

Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A if one named a already exists. Worse, seemingly-allowed names like PRN and CON, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for
naming files and folders
are on the Microsoft docs.

You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A, AB, A2 et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.

If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.

Allowed characters in filename

You should start with the Wikipedia Filename page. It has a decent-sized table (Comparison of filename limitations), listing the reserved characters for quite a lot of file systems.

It also has a plethora of other information about each file system, including reserved file names such as CON under MS-DOS. I mention that only because I was bitten by that once when I shortened an include file from const.h to con.h and spent half an hour figuring out why the compiler hung.

Turns out DOS ignored extensions for devices so that con.h was exactly the same as con, the input console (meaning, of course, the compiler was waiting for me to type in the header file before it would continue).

Invalid characters in a filename on Windows?

Yes, in an ASCII based file system Path.GetInvalidFileNameChars() will guarantee you a safe file name. If you check the ASCII chart here you will find that everything from the left column is excluded and certain characters from the remaining columns are also excluded. Check the decimal representation of each char in the returned array for a full list of what's excluded.

How do I check if a given string is a legal/valid file name under Windows?

You can get a list of invalid characters from Path.GetInvalidPathChars and GetInvalidFileNameChars.

UPD: See Steve Cooper's suggestion on how to use these in a regular expression.

UPD2: Note that according to the Remarks section in MSDN "The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names." The answer provided by sixlettervaliables goes into more details.

Looking for a character that is allowed in Filenames but not allowed in email addresses... Any clue?

Comma and semi-colon is not allowed in email address but in filenames on most file systems.

string sanitizer for filename

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

How can I safely encode a string in Java to use as a filename?

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. If collisions must be avoided, then simple replacement or removal of "bad" characters is not the answer either.

Instead you want something like this. (Note: this should be treated as an illustrative example, not something to copy and paste.)

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
char ch = s.charAt(i);
if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
|| (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
|| ch == escape) {
sb.append(escape);
if (ch < 0x10) {
sb.append('0');
}
sb.append(Integer.toHexString(ch));
} else {
sb.append(ch);
}
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

URLEncoder works, but it has the disadvantage that it encodes a whole lot of legal file name characters.

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.


The reverse of the above encoding should be equally straight-forward to implement.

Regex to replace characters that Windows doesn't accept in a filename

Windows filename rules are tricky. You're only scratching the surface.

For example here are some things that are not valid filenames, in addition to the chracters you listed:

                                    (yes, that's an empty string)
.
.a
a.
a (that's a leading space)
a (or a trailing space)
com
prn.txt
[anything over 240 characters]
[any control characters]
[any non-ASCII chracters that don't fit in the system codepage,
if the filesystem is FAT32]

Removing special characters in a single regex sub like String.replaceAll() isn't enough; you can easily end up with something invalid like an empty string or trailing ‘.’ or ‘ ’. Replacing something like “[^A-Za-z0-9_.]*” with ‘_’ would be a better first step. But you will still need higher-level processing on whatever platform you're using.



Related Topics



Leave a reply



Submit