Sanitizing Strings to Make Them Url and Filename Safe

Sanitizing strings to make them URL and filename safe?

Some observations on your solution:

'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
\w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

Sanitizing string to make it URL and filename safe?

You can put your PHP-value in a javascript value like this:

<script>

var JSvar = "<?= $phpVar ?>";

</script>

string sanitizer for filename

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

javascript url-safe filename-safe string

Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.

var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();

Explanation:

The regular expression is /[^a-z0-9]/gi. Well, actually the gi at the end is just a set of options that are used when the expression is used.

i means "ignore upper/lower case differences"
g means "global", which really means that every match should be replaced, not just the first one.

So what we're looking as is really just [^a-z0-9]. Let's read it step-by-step:

The [ and ] define a "character class", which is a list of single-characters. If you'd write [one], then that would match either 'o' or 'n' or 'e'.
However, there's a ^ at the start of the list of characters. That means it should match only characters not in the list.
Finally, the list of characters is a-z0-9. Read this as "a through z and 0 through 9". It's a short way of writing abcdefghijklmnopqrstuvwxyz0123456789.

So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".

Create (sane/safe) filename from any (unsafe) string

Python:

"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()

this accepts Unicode characters but removes line breaks, etc.

example:

filename = u"ad\nbla'{-+\)(ç?"

gives: adblaç

edit
str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

    keepcharacters = (' ','.','_')
    "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()

How can I sanitize a string while maintaining all non-English alphabet support

Null bytes are not(!) UTF-8, so assuming you use UTF-8 internally, all you need to do is to verify that the passed variables are UTF-8. There's no need to support UTF-16, for example, because you as author of the according API or form define the correct encoding and you can limit yourself to UTF-8. Further, "unicode" is also not an encoding you need to support, simply because it is not an encoding. Rather, Unicode is a standard and the UTF encodings are part of it.

Now, back to PHP, the function you are looking for is mb_check_encoding(). Error handling is simple, if any parameter doesn't pass that test, you reply with a "bad request" response. No need to try to guess what the user might have wanted.

While the question doesn't specifically ask this, here are some examples and how they should be handled on input:

non-UTF-8 bytes: Reject with 400 ("bad request").
strings containing path elements (like ../): Accept.
filename (not file path) containing path elements (like ../): Reject with 400.
filenames شعار.jpg, 标志.png or логотип.png: Accept.
filename foo <0> bar.jpg: Accept.
number abc: Reject with 400.
number 1234: Accept.

Here's how to handle them for different outputs:

non-UTF-8 bytes: Can't happen, they were rejected before.
filename containing path elements: Can't happen, they were rejected before.
filenames شعار.jpg, 标志.png or логотип.png in HTML: Use verbatim if the HTML encoding is UTF-8, replace as HTML entities when using default ISO8859-1.
filenames شعار.jpg, 标志.png or логотип.png in Bash: Use verbatim, assuming the filesystem's encoding is UTF-8.
filenames شعار.jpg, 标志.png or логотип.png in SQL: Probably just quote, depends on the driver, DB, tables etc. Consult the manual.
filename foo <0> bar.jpg in HTML: Escape as "foo <0> bar.jpeg". Maybe use " " for the spaces.
filename foo <0> bar.jpg in Bash: Quote or escape " ", "<" and ">" with backslashes.
filename foo <0> bar.jpg in SQL: Just quote.
number abc: Can't happen, they were rejected before.
number 1234 in HTML: Use verbatim.
number 1234 in Bash: Use verbatim (not sure).
number 1234 in SQL: Use verbatim.

The general procedure should be:

Define your internal types (string, filename, number) and reject anything that doesn't match. These types create constraints (filename doesn't include path elements) and offer guarantees (filename can be appended to a directory to form a filename inside that directory).
Use a template library (Moustache comes to mind) for HTML.
Use a DB wrapper library (PDO, Propel, Doctrine) for SQL.
Escape shell parameters. I'm not sure which way to go here, but I'm sure you will find proper ways.

Escaping is not a defined procedure but a family of procedures. The actual escaping algorithm used depends on the target context. Other than what you wrote ("escaping will also screw up the names"), the actual opposite should be the case! Basically, it makes sure that a string containing a less-than sign in XML remains a string containing a less-than sign and doesn't turn into a malformed XML snippet. In order to achieve that, escaping converts strings to prevent any character that is normally not interpreted as just text from getting its normal interpretation, like the space character in the shell.

Is there some way to make Twig stop sanitizing HTML URL links?

You can use the raw filter to prevent HTML from being escaped:

{{ some_html|raw }}

Or maybe a better option would be to use it with the striptags filter and whitelist <a> tags:

{{ some_html|striptags('<a>')|raw }}

Internally, Twig uses the PHP strip_tags function. Note that its documentation has this warning:

Warning

This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.

See TwigFiddle.

Sanitizing url and parameters

As you are appending params_str_submitted_by_user to the base URL after the ? delimiter, you are safe from this type of attack used where the context of the domain is changed to a username or password:

Say URL was http://example.com and params_str_submitted_by_user was @evil.com and you did not have the / or ? characters in your URL string concatenation.

This would make your URL http://example.com@evil.com which actually means username example.com at domain evil.com.

However, the username cannot contain the ? (nor slash) character, so you should be safe as you are forcing the username to be concatenated. In your case URL becomes:

http://example.com?@evil.com

http://example.com/?@evil.com

if you include the slash in your base URL (better practise). These are safe as all it does is pass your website evil.com as a query string value because @evil.com will no longer be interpretted as a domain by the parser.

What is the worst case scenario if even newlines are left in and the user can arbitrarily manipulate the HTTP headers?

This depends on how good your http_get function is at sanitizing values. If http_get does not strip newlines internally it could be possible for an attacker to control the headers sent from your application.

e.g. If http_get internally created the following request

GET <url> HTTP/1.1
Host: <url.domain>

so under legitimate use it would work like the following:

http_get("https://example.com/foo/bar")

generates

GET /foo/bar HTTP/1.1
Host: example.com

an attacker could set params_str_submitted_by_user to

<space>HTTP/1.1\r\nHost: example.org\r\nCookie: foo=bar\r\n\r\n

this would cause your code to call

http_get("https://example.com/" + "?" + "<space>HTTP/1.1\r\nHost: example.org\r\nCookie: foo=bar\r\n\r\n")

which would cause the request to be

GET / HTTP/1.1
Host: example.org
Cookie: foo=bar

 HTTP/1.1
Host: example.com

Depending on how http_get parses the domain this might not cause the request to go to example.org instead of example.com - it is just manipulating the header (unless example.org was another site on the same IP address as your site). However, the attacker has managed to manipulate headers and add their own cookie value. The advantage to the attacker depends on what can be gained under your particular setup from them doing this - there is not necessarily any general advantage, it would be more of a logic flaw exploit if they could trick your code into behaving in an unexpected way by causing it to make requests under the control of the attacker.

What should you do?

To guard against the unexpected and unknown, either use a version of http_get that handles header injection properly. Many modern languages now deal with this situation internally.

Or - if http_get is your own implementation, make sure it sanitizes or rejects URLs that contain invalid characters like carriage returns or line feeds and other parameters that are invalid in a URL. See this question for list of valid characters.

Sanitizing Strings to Make Them Url and Filename Safe