Sanitizing strings to make them URL and filename safe?
Some observations on your solution:
- 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
- \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
- The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
Creating the slug
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
Sanitization in general
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
The Encoder interface provides:
canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)
https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
Sanitizing string to make it URL and filename safe?
You can put your PHP-value in a javascript value like this:
<script>
var JSvar = "<?= $phpVar ?>";
</script>
string sanitizer for filename
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z
, 0-9
, _
, and a single instance of a period (.
). That's obviously more limiting than most filesystems, but should keep you safe.
javascript url-safe filename-safe string
Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.
var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();
Explanation:
The regular expression is /[^a-z0-9]/gi
. Well, actually the gi
at the end is just a set of options that are used when the expression is used.
i
means "ignore upper/lower case differences"g
means "global", which really means that every match should be replaced, not just the first one.
So what we're looking as is really just [^a-z0-9]
. Let's read it step-by-step:
- The
[
and]
define a "character class", which is a list of single-characters. If you'd write[one]
, then that would match either 'o' or 'n' or 'e'. - However, there's a
^
at the start of the list of characters. That means it should match only characters not in the list. - Finally, the list of characters is
a-z0-9
. Read this as "a through z and 0 through 9". It's a short way of writingabcdefghijklmnopqrstuvwxyz0123456789
.
So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".
Create (sane/safe) filename from any (unsafe) string
Python:
"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()
this accepts Unicode characters but removes line breaks, etc.
example:
filename = u"ad\nbla'{-+\)(ç?"
gives: adblaç
edit
str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.
keepcharacters = (' ','.','_')
"".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
How can I sanitize a string while maintaining all non-English alphabet support
Null bytes are not(!) UTF-8, so assuming you use UTF-8 internally, all you need to do is to verify that the passed variables are UTF-8. There's no need to support UTF-16, for example, because you as author of the according API or form define the correct encoding and you can limit yourself to UTF-8. Further, "unicode" is also not an encoding you need to support, simply because it is not an encoding. Rather, Unicode is a standard and the UTF encodings are part of it.
Now, back to PHP, the function you are looking for is mb_check_encoding(). Error handling is simple, if any parameter doesn't pass that test, you reply with a "bad request" response. No need to try to guess what the user might have wanted.
While the question doesn't specifically ask this, here are some examples and how they should be handled on input:
- non-UTF-8 bytes: Reject with 400 ("bad request").
- strings containing path elements (like
../
): Accept. - filename (not file path) containing path elements (like
../
): Reject with 400. - filenames
شعار.jpg
,标志.png
orлоготип.png
: Accept. - filename
foo <0> bar.jpg
: Accept. - number
abc
: Reject with 400. - number
1234
: Accept.
Here's how to handle them for different outputs:
- non-UTF-8 bytes: Can't happen, they were rejected before.
- filename containing path elements: Can't happen, they were rejected before.
- filenames
شعار.jpg
,标志.png
orлоготип.png
in HTML: Use verbatim if the HTML encoding is UTF-8, replace as HTML entities when using default ISO8859-1. - filenames
شعار.jpg
,标志.png
orлоготип.png
in Bash: Use verbatim, assuming the filesystem's encoding is UTF-8. - filenames
شعار.jpg
,标志.png
orлоготип.png
in SQL: Probably just quote, depends on the driver, DB, tables etc. Consult the manual. - filename
foo <0> bar.jpg
in HTML: Escape as "foo <0> bar.jpeg". Maybe use " " for the spaces. - filename
foo <0> bar.jpg
in Bash: Quote or escape " ", "<" and ">" with backslashes. - filename
foo <0> bar.jpg
in SQL: Just quote. - number
abc
: Can't happen, they were rejected before. - number
1234
in HTML: Use verbatim. - number
1234
in Bash: Use verbatim (not sure). - number
1234
in SQL: Use verbatim.
The general procedure should be:
- Define your internal types (string, filename, number) and reject anything that doesn't match. These types create constraints (filename doesn't include path elements) and offer guarantees (filename can be appended to a directory to form a filename inside that directory).
- Use a template library (Moustache comes to mind) for HTML.
- Use a DB wrapper library (PDO, Propel, Doctrine) for SQL.
- Escape shell parameters. I'm not sure which way to go here, but I'm sure you will find proper ways.
Escaping is not a defined procedure but a family of procedures. The actual escaping algorithm used depends on the target context. Other than what you wrote ("escaping will also screw up the names"), the actual opposite should be the case! Basically, it makes sure that a string containing a less-than sign in XML remains a string containing a less-than sign and doesn't turn into a malformed XML snippet. In order to achieve that, escaping converts strings to prevent any character that is normally not interpreted as just text from getting its normal interpretation, like the space character in the shell.
Is there some way to make Twig stop sanitizing HTML URL links?
You can use the raw
filter to prevent HTML from being escaped:
{{ some_html|raw }}
Or maybe a better option would be to use it with the striptags
filter and whitelist <a>
tags:
{{ some_html|striptags('<a>')|raw }}
Internally, Twig uses the PHP strip_tags
function. Note that its documentation has this warning:
Warning
This function does not modify any attributes on the tags that you allow using
allowable_tags
, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
See TwigFiddle.
Sanitizing url and parameters
As you are appending params_str_submitted_by_user
to the base URL after the ?
delimiter, you are safe from this type of attack used where the context of the domain is changed to a username or password:
Say URL was http://example.com
and params_str_submitted_by_user
was @evil.com
and you did not have the /
or ?
characters in your URL string concatenation.
This would make your URL http://example.com@evil.com
which actually means username example.com
at domain evil.com
.
However, the username cannot contain the ?
(nor slash) character, so you should be safe as you are forcing the username to be concatenated. In your case URL becomes:
http://example.com?@evil.com
or
http://example.com/?@evil.com
if you include the slash in your base URL (better practise). These are safe as all it does is pass your website evil.com
as a query string value because @evil.com
will no longer be interpretted as a domain by the parser.
What is the worst case scenario if even newlines are left in and the user can arbitrarily manipulate the HTTP headers?
This depends on how good your http_get
function is at sanitizing values. If http_get
does not strip newlines internally it could be possible for an attacker to control the headers sent from your application.
e.g. If http_get
internally created the following request
GET <url> HTTP/1.1
Host: <url.domain>
so under legitimate use it would work like the following:
http_get("https://example.com/foo/bar")
generates
GET /foo/bar HTTP/1.1
Host: example.com
an attacker could set params_str_submitted_by_user
to
<space>HTTP/1.1\r\nHost: example.org\r\nCookie: foo=bar\r\n\r\n
this would cause your code to call
http_get("https://example.com/" + "?" + "<space>HTTP/1.1\r\nHost: example.org\r\nCookie: foo=bar\r\n\r\n")
which would cause the request to be
GET / HTTP/1.1
Host: example.org
Cookie: foo=bar
HTTP/1.1
Host: example.com
Depending on how http_get
parses the domain this might not cause the request to go to example.org
instead of example.com
- it is just manipulating the header (unless example.org
was another site on the same IP address as your site). However, the attacker has managed to manipulate headers and add their own cookie value. The advantage to the attacker depends on what can be gained under your particular setup from them doing this - there is not necessarily any general advantage, it would be more of a logic flaw exploit if they could trick your code into behaving in an unexpected way by causing it to make requests under the control of the attacker.
What should you do?
To guard against the unexpected and unknown, either use a version of http_get
that handles header injection properly. Many modern languages now deal with this situation internally.
Or - if http_get
is your own implementation, make sure it sanitizes or rejects URLs that contain invalid characters like carriage returns or line feeds and other parameters that are invalid in a URL. See this question for list of valid characters.
Related Topics
Convert Timestamp to Readable Date/Time PHP
Variable-Length Lookbehind-Assertion Alternatives For Regular Expressions
Explode String into Array With No Empty Elements
Use PHP to Set Cron Jobs in Windows
How Safe Are Pdo Prepared Statements
MySQL_Fetch_Array, MySQL_Fetch_Assoc, MySQL_Fetch_Object
PHP - How to Best Determine If the Current Invocation Is from Cli or Web Server
How to Clear Browser Cache With PHP
Change Div Content Using Ajax, PHP and Jquery
Strtotime With Different Languages
Simplify PHP Dom Xml Parsing - How
Update Fee Dynamically Based on Radio Buttons in Woocommerce Checkout
Simplexml and Print_R() - Why Is This Empty
Multidimensional Array Iteration
How to Use PHP Namespaces With Autoload