What Does Filter_Sanitize_String Do

What does FILTER_SANITIZE_STRING do?

According to PHP Manual:

Strip tags, optionally strip or encode special characters.

According to W3Schools:

The FILTER_SANITIZE_STRING filter strips or encodes unwanted characters.

This filter removes data that is potentially harmful for your application. It is used to strip tags and remove or encode unwanted characters.

Now, that doesn't tell us much. Let's go see some PHP sources.

ext/filter/filter.c:

static const filter_list_entry filter_list[] = {                                       
/*...*/
{ "string", FILTER_SANITIZE_STRING, php_filter_string },
{ "stripped", FILTER_SANITIZE_STRING, php_filter_string },
{ "encoded", FILTER_SANITIZE_ENCODED, php_filter_encoded },
/*...*/

Now, let's go see how php_filter_string is defined.

ext/filter/sanitizing_filters.c:

/* {{{ php_filter_string */
void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
{
size_t new_len;
unsigned char enc[256] = {0};

/* strip high/strip low ( see flags )*/
php_filter_strip(value, flags);

if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
enc['\''] = enc['"'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_AMP) {
enc['&'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_LOW) {
memset(enc, 1, 32);
}
if (flags & FILTER_FLAG_ENCODE_HIGH) {
memset(enc + 127, 1, sizeof(enc) - 127);
}

php_filter_encode_html(value, enc);

/* strip tags, implicitly also removes \0 chars */
new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, NULL, 0, 1);
Z_STRLEN_P(value) = new_len;

if (new_len == 0) {
zval_dtor(value);
if (flags & FILTER_FLAG_EMPTY_STRING_NULL) {
ZVAL_NULL(value);
} else {
ZVAL_EMPTY_STRING(value);
}
return;
}
}

I'll skip commenting flags since they're already explained on the Internet, like you said, and focus on what is always performed instead, which is not so well documented.

First - php_filter_strip. It doesn't do much, just takes the flags you pass to the function and processes them accordingly. It does the well-documented stuff.

Then we construct some kind of map and call php_filter_encode_html. It's more interesting: it converts stuff like ", ', & and chars with their ASCII codes lower than 32 and higher than 127 to HTML entities, so & in your string becomes &. Again, it uses flags for this.

Then we get call to php_strip_tags_ex, which just strips HTML, XML and PHP tags (according to its definition in /ext/standard/string.c) and removes NULL bytes, like the comment says.

The code that follows it is used for internal string management and doesn't really do any sanitization. Well, not exactly - passing undocumented flag FILTER_FLAG_EMPTY_STRING_NULL will return NULL if the sanitized string is empty, instead of returning just an empty string, but it's not really that much useful. An example:

var_dump(filter_var("yo", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("yo", FILTER_SANITIZE_STRING));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING));

string(2) "yo"
NULL
string(2) "yo"
string(0) ""

There isn't much more going on, so the manual was fairly correct - to sum it up:

  • Always: strip HTML, XML and PHP tags, strip NULL bytes.
  • FILTER_FLAG_NO_ENCODE_QUOTES - This flag does not encode quotes.
  • FILTER_FLAG_STRIP_LOW - Strip characters with ASCII value below 32.
  • FILTER_FLAG_STRIP_HIGH - Strip characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_LOW - Encode characters with ASCII value below 32.
  • FILTER_FLAG_ENCODE_HIGH - Encode characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_AMP - Encode the & character to & (not &).
  • FILTER_FLAG_EMPTY_STRING_NULL - Return NULL instead of empty strings.

FILTER_SANITIZE_STRING is stripping the character and any text after it

The root issue is that when you use FILTER_SANITIZE_STRING to strip HTML tags you are handling your input as HTML. According to your description, your input is plain text. As such, the filter can only corrupt the input data, as users have already reported.

While it seems to be quite a popular technique, I've never understood the concept of striping HTML tags on plain text as sanitization method. If it isn't HTML you don't need to care about HTML tags, for the same reason that you don't need to care about SQL keywords or command line commands. It's nothing but data.

But, of course, when you inject your string into HTML afterwards you need to escape it in order to ensure that:

  1. Your data is displayed as-is
  2. The result is still valid HTML

That's why htmlspecialchars() exists. Similarly, you need to use the corresponding escape mechanism when you dynamically generate any other kind of code: SQL, JavaScript, JSON...

Constant FILTER_SANITIZE_STRING is deprecated

This filter had an unclear purpose. It's difficult to say what exactly it was meant to accomplish or when it should be used. It was also confused with the default string filter, due to its name, when in reality the default string filter is called FILTER_UNSAFE_RAW. The PHP community decided that the usage of this filter should not be supported anymore.

The behaviour of this filter was very unintuitive. It removed everything between < and the end of the string or until the next >. It also removed all NUL bytes. Finally, it encoded ' and " into their HTML entities.

If you want to replace it, you have a couple of options:

  1. Use the default string filter FILTER_UNSAFE_RAW that doesn't do any filtering. This should be used if you had no idea about the behaviour of FILTER_SANITIZE_STRING and you just want to use a default filter that will give you the string value.

  2. If you used this filter to protect against XSS vulnerabilities, then replace its usage with htmlspecialchars(). Don't call this function on the input data. To protect against XSS you need to encode the output!

  3. If you knew exactly what that filter does and you want to create a polyfill, you can do that easily with regex.

    function filter_string_polyfill(string $string): string
    {
    $str = preg_replace('/\x00|<[^>]*>?/', '', $string);
    return str_replace(["'", '"'], [''', '"'], $str);
    }

Don’t try to sanitize input. Escape output.

Why does FILTER_SANITIZE_STRING remove part of the SQL string?

You are explicitely removing all strings between < and > when you pass it through your filter. As to why you are doing this and expect different results, I am unsure.

public function SetSQL($sql) {
$this->sql = filter_var($sql, FILTER_SANITIZE_STRING, FILTER_FLAG_NO_ENCODE_QUOTES);
}

Maybe you are confused about the name of FILTER_SANITIZE_STRING. This filter removes all substrings between < and > (or the end of the whole string) including the brackets. It doesn't encode quotes as you have disabled that with a flag.

To fix this, simply remove that filter altogether. This will do what you want:

public function SetSQL($sql) {
$this->sql = $sql;
}

By the way, constant FILTER_SANITIZE_STRING is deprecated. Please stop using it.

Sanitize filter_var PHP string but keep '

Computer data itself is neither harmful nor innocuous. It's just a piece of information that can be later be used for a given purpose.

Sometimes, data is used as computer source code and such code eventually leads to physical actions (a disk spins, a led blinks, a picture is uploaded to remote computer, a thermostat turns off the boiler...). And it's then (and only then) when data can become harmful; we even lose expensive space ships now and then because of software bugs.

Code you write yourself can be as harmful or innocuous as your abilities or good faith dictate. The big problem comes when your application has a vulnerability that allows execution of untrusted third-party code. This is particularly serious in web applications, which are connected to the open internet and are expected to receive data from anywhere in the world. But, how's that physically possible? There're several ways but the most typical case is due to dynamically generated code and this happens all the time in modern www. You use PHP to generate SQL, HTML, JavaScript... If you pick untrusted arbitrary data (e.g. an URL parameter or a form field) and use it to compose code that will later be executed (either by your server or by the visitor's browser) someone can be hacked (either you or your users).

You'll see that everyday here at Stack Overflow:

$username = $_POST["username"];
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));
<td><?php echo $row["title"]; ?></td>
var id = "<?php echo $_GET["id"]; ?>";

Faced to this problem, some claim: let's sanitize! It's obvious that some characters are evil so we'll remove them all and we're done, right? And then we see stuff like this:

$username = $_POST["username"];
$username = strip_tags($username);
$username = htmlentities($username);
$username = stripslashes($username);
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));

This is a surprisingly widespread misconception adopted even by some professionals. You see the symptoms everywhere: your comment is mutilated at first < symbol, you get "your password cannot contain spaces" on sign-up and you read Why can’t I use certain words like "drop" as part of my Security Question answers? in the FAQ. It's even inside computer languages: whenever you read "sanitize", "escape"... in a function name (without further context), you have a good hint that it might be a misguided effort.

It's all about establishing a clear separation about data and code: user provides data but only you provide code. And there isn't a universal one-size-fits-all solution because each computer language has its own syntax and rules. DROP TABLE users; can be terribly dangerous in SQL:

mysql> DROP TABLE users;
Query OK, 56020 rows affected (0.52 sec)

(oops!)... but it's not as bad in e.g. JavaScript. Look, it doesn't even run:

C:\>node
> DROP TABLE users;
SyntaxError: Unexpected identifier
at Object.exports.createScript (vm.js:24:10)
at REPLServer.defaultEval (repl.js:235:25)
at bound (domain.js:287:14)
at REPLServer.runBound [as eval] (domain.js:300:12)
at REPLServer.<anonymous> (repl.js:427:12)
at emitOne (events.js:95:20)
at REPLServer.emit (events.js:182:7)
at REPLServer.Interface._onLine (readline.js:211:10)
at REPLServer.Interface._line (readline.js:550:8)
at REPLServer.Interface._ttyWrite (readline.js:827:14)
>

This last example also illustrates that it's not only a security concern. Even if you're not being hacked, generating code from random input can simply make your app crash:

SELECT * FROM customers WHERE last_name='O'Brian';

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Brian''

So, what shall be done then if there isn't a universal solution?

  1. Understand the problem:

    If you inject raw literal data improperly it can become code (and sometimes invalid code).

  2. Use the specific mechanism for each technology:

    If target language requires escaping:

    <p><3 to code</p><p><3 to code</p>

    ... find a specific tool to escape in source language:

    echo '<p>' . htmlspecialchars($motto) . '</p>';

    If language/framework/technology allows to send data in a separate channel, do it:

     $sql = 'SELECT password_hash FROM user WHERE username=:username';
    $params = array(
    'username' => $username,
    );

PHP filter_input_array with FILTER_SANITIZE_STRING allows empty string

You can create a custom filter:

$filter = array('filter' => FILTER_CALLBACK, 'options' => function ($input) {
$filtered = filter_var($input, FILTER_SANITIZE_STRING);
return $filtered ? $filtered: null;
});

And then use it in $args:

$args = array(
'value' => $filter
);

$inputs = filter_input_array(INPUT_POST, $args);

PHP - FILTER_SANITIZE_ENCODED does`t work correctly

not sure i understand exactly what you want, but.. does this work?

function encode(string $str): string
{
$ret = "";
for ($i = 0, $max = strlen($str); $i < $max; ++ $i) {
if (ord($str[$i]) > 127) {
$ret .= $str[$i];
} else {
$ret .= urlencode($str[$i]);
}
}
return $ret;
}

it urlencodes all characters except those with uint8-values above 127, which are left untouched.. is that what you want?

edit: another possibility, if you're talking about CP-437, you want to urlencode every character except characters above 127 in CP-437? in which case,

function urlencode_cp437(string $str): string
{
$ret = "";
for ($i = 0, $max = strlen($str); $i < $max; ++ $i) {
if (ord($str[$i]) > 127 && mb_check_encoding($str[$i], 'CP-437')) {
$ret .= $str[$i];
} else {
$ret .= urlencode($str[$i]);
}
}
return $ret;
}
  • i guess? (untested, but should work, i guess?)
  • should probably use mb_ord() instead of ord(), but since CP-427 is a single-byte ascii-compatible encoding, and mb_ord doesn't exist until php 7.2, i guess ord() would work regardless, idk

FILTER_SANITIZE_STRING filters letters that shouldn't be filtered

Solution: Don't filter.

HTML encode when you output to an HTML page.

e.g.
& becomes &, < becomes <.

Use htmlentities to do this.

HTML code isn't dangerous in your database - it's only dangerous when user input is output unencoded.

Since you're already using prepared statements, you're already protected against SQLi (assuming there's not any query concatenation going on anywhere of course, for example within any SPs you are calling).



Related Topics



Leave a reply



Submit