string sanitizer for filename
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z
, 0-9
, _
, and a single instance of a period (.
). That's obviously more limiting than most filesystems, but should keep you safe.
Sanitizing strings to make them URL and filename safe?
Some observations on your solution:
- 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
- \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
- The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
Creating the slug
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
Sanitization in general
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
The Encoder interface provides:
canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)
https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
Sanitizing strings with filenames and extension in Java
You may add alternatives to the regex to match all kinds of scenarios:
(?:(\.\w+)\1*|\.|([^.]))$
And replace with $2.pdf
. See the regex demo.
EDIT: In case the extensions that can be duplicated are known, you may use the whitelisting approach via an alternation group:
(?:(\.(?:pdf|gif|jpe?g))\1*|\.|([^.]))$
See another regex demo.
Details:
(?:
- start of grouping, the$
end of string anchor is applied to all the alternatives below (they must be at the end of string)(\.\w+)\1*
- duplicated (or not) extensions (.
+ 1+ word chars repeated zero or more times) (with the whitelisting approach, only the indicated extensions will be taken into account -(?:pdf|gif|jpe?g)
will only matchpdf
,gif
, jpeg, jpg
, etc. if more alternatives are added)|
- or\.
- a dot|
- or([^.])
- any char that is not a dot captured into Group 2
)
- end of the outer grouping$
- end of string.
See Java demo:
List<String> strs = Arrays.asList("doubleexsension.pdf.pdf","noextension","nameWithDot.","properName.pdf");
for (String str : strs)
System.out.println(str.replaceAll("(?:(\\.\\w+)\\1*|\\.|([^.]))$", "$2.pdf"));
How can I sanitize a string for use as a filename?
You can use PathGetCharType function, PathCleanupSpec function or the following trick:
function IsValidFilePath(const FileName: String): Boolean;
var
S: String;
I: Integer;
begin
Result := False;
S := FileName;
repeat
I := LastDelimiter('\/', S);
MoveFile(nil, PChar(S));
if (GetLastError = ERROR_ALREADY_EXISTS) or
(
(GetFileAttributes(PChar(Copy(S, I + 1, MaxInt))) = INVALID_FILE_ATTRIBUTES)
and
(GetLastError=ERROR_INVALID_NAME)
) then
Exit;
if I>0 then
S := Copy(S,1,I-1);
until I = 0;
Result := True;
end;
This code divides string into parts and uses MoveFile to verify each part. MoveFile will fail for invalid characters or reserved file names (like 'COM') and return success or ERROR_ALREADY_EXISTS for valid file name.
PathCleanupSpec is in the Jedi Windows API under Win32API/JwaShlObj.pas
Sanitizing a file path in C# without compromising the drive letter
You definitely should make sure that you only receive valid filenames.
If you can't, and you're certain your directory names will be, you could split the path the last backslash (assuming Windows) and reassemble the string:
public static string SanitizePath(string path)
{
var lastBackslash = path.LastIndexOf('\\');
var dir = path.Substring(0, lastBackslash);
var file = path.Substring(lastBackslash, path.Length - lastBackslash);
foreach (var invalid in Path.GetInvalidFileNameChars())
{
file = file.Replace(invalid, '_');
}
return dir + file;
}
Turn a string into a valid filename?
You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.
The Django text utils define a function, slugify()
, that's probably the gold standard for this kind of thing. Essentially, their code is the following.
import unicodedata
import re
def slugify(value, allow_unicode=False):
"""
Taken from https://github.com/django/django/blob/master/django/utils/text.py
Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
dashes to single dashes. Remove characters that aren't alphanumerics,
underscores, or hyphens. Convert to lowercase. Also strip leading and
trailing whitespace, dashes, and underscores.
"""
value = str(value)
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value.lower())
return re.sub(r'[-\s]+', '-', value).strip('-_')
And the older version:
def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
value = unicode(re.sub('[-\s]+', '-', value))
# ...
return value
There's more, but I left it out, since it doesn't address slugification, but escaping.
javascript url-safe filename-safe string
Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.
var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();
Explanation:
The regular expression is /[^a-z0-9]/gi
. Well, actually the gi
at the end is just a set of options that are used when the expression is used.
i
means "ignore upper/lower case differences"g
means "global", which really means that every match should be replaced, not just the first one.
So what we're looking as is really just [^a-z0-9]
. Let's read it step-by-step:
- The
[
and]
define a "character class", which is a list of single-characters. If you'd write[one]
, then that would match either 'o' or 'n' or 'e'. - However, there's a
^
at the start of the list of characters. That means it should match only characters not in the list. - Finally, the list of characters is
a-z0-9
. Read this as "a through z and 0 through 9". It's a short way of writingabcdefghijklmnopqrstuvwxyz0123456789
.
So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".
Related Topics
Use a Variable to Define a PHP Function
Smtp Configuration for PHP Mail
Creating and Update Laravel Eloquent
How to Use Multiple Databases Dynamically for One Model in Cakephp
Checking If All the Array Items Are Empty PHP
Woocommerce: Add Product to Cart with Price Override
Do Login Forms Need Tokens Against Csrf Attacks
Csrf (Cross-Site Request Forgery) Attack Example and Prevention in PHP
Smtp Server Response: 530 5.7.0 Must Issue a Starttls Command First
How to Check and Set Max_Allowed_Packet MySQL Variable
Fatal Error: Maximum Execution Time of 300 Seconds Exceeded
Mysql_Fetch_Array() Expects Parameter 1 to Be Resource Problem
How to Use Objects from Other Namespaces and How to Import Namespaces in PHP