String Sanitizer for Filename

string sanitizer for filename

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

Sanitizing strings to make them URL and filename safe?

Some observations on your solution:

'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
\w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

Sanitizing strings with filenames and extension in Java

You may add alternatives to the regex to match all kinds of scenarios:

(?:(\.\w+)\1*|\.|([^.]))$

And replace with $2.pdf. See the regex demo.

EDIT: In case the extensions that can be duplicated are known, you may use the whitelisting approach via an alternation group:

(?:(\.(?:pdf|gif|jpe?g))\1*|\.|([^.]))$

See another regex demo.

Details:

(?: - start of grouping, the $ end of string anchor is applied to all the alternatives below (they must be at the end of string)
- (\.\w+)\1* - duplicated (or not) extensions (. + 1+ word chars repeated zero or more times) (with the whitelisting approach, only the indicated extensions will be taken into account - (?:pdf|gif|jpe?g) will only match pdf, gif, jpeg, jpg, etc. if more alternatives are added)
- | - or
- \. - a dot
- | - or
- ([^.]) - any char that is not a dot captured into Group 2
) - end of the outer grouping
$ - end of string.

See Java demo:

List<String> strs = Arrays.asList("doubleexsension.pdf.pdf","noextension","nameWithDot.","properName.pdf");
for (String str : strs)
    System.out.println(str.replaceAll("(?:(\\.\\w+)\\1*|\\.|([^.]))$", "$2.pdf"));

How can I sanitize a string for use as a filename?

You can use PathGetCharType function, PathCleanupSpec function or the following trick:

  function IsValidFilePath(const FileName: String): Boolean;
  var
    S: String;
    I: Integer;
  begin
    Result := False;
    S := FileName;
    repeat
      I := LastDelimiter('\/', S);
      MoveFile(nil, PChar(S));
      if (GetLastError = ERROR_ALREADY_EXISTS) or
         (
           (GetFileAttributes(PChar(Copy(S, I + 1, MaxInt))) = INVALID_FILE_ATTRIBUTES)
           and
           (GetLastError=ERROR_INVALID_NAME)
         ) then
        Exit;
      if I>0 then
        S := Copy(S,1,I-1);
    until I = 0;
    Result := True;
  end;

This code divides string into parts and uses MoveFile to verify each part. MoveFile will fail for invalid characters or reserved file names (like 'COM') and return success or ERROR_ALREADY_EXISTS for valid file name.

PathCleanupSpec is in the Jedi Windows API under Win32API/JwaShlObj.pas

Sanitizing a file path in C# without compromising the drive letter

You definitely should make sure that you only receive valid filenames.

If you can't, and you're certain your directory names will be, you could split the path the last backslash (assuming Windows) and reassemble the string:

public static string SanitizePath(string path)
{
    var lastBackslash = path.LastIndexOf('\\');

    var dir = path.Substring(0, lastBackslash);
    var file = path.Substring(lastBackslash, path.Length - lastBackslash);

    foreach (var invalid in Path.GetInvalidFileNameChars())
    {
        file = file.Replace(invalid, '_');
    }

    return dir + file;
}

Turn a string into a valid filename?

You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.

The Django text utils define a function, slugify(), that's probably the gold standard for this kind of thing. Essentially, their code is the following.

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

And the older version:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    value = unicode(re.sub('[-\s]+', '-', value))
    # ...
    return value

There's more, but I left it out, since it doesn't address slugification, but escaping.

javascript url-safe filename-safe string

Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.

var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();

Explanation:

The regular expression is /[^a-z0-9]/gi. Well, actually the gi at the end is just a set of options that are used when the expression is used.

i means "ignore upper/lower case differences"
g means "global", which really means that every match should be replaced, not just the first one.

So what we're looking as is really just [^a-z0-9]. Let's read it step-by-step:

The [ and ] define a "character class", which is a list of single-characters. If you'd write [one], then that would match either 'o' or 'n' or 'e'.
However, there's a ^ at the start of the list of characters. That means it should match only characters not in the list.
Finally, the list of characters is a-z0-9. Read this as "a through z and 0 through 9". It's a short way of writing abcdefghijklmnopqrstuvwxyz0123456789.

So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".

String Sanitizer for Filename