Prevent Xss with Strip_Tags()

why not using strip_tags() to prevent xss attack instead of htmlspecialchars()?

Because strip_tags doesn't fix every possible abuse case. True, it fixes the worst offenders, but there are other cases, e.g. when inserting values back into <input> tags yourself, where the quotes can be broken out of.

Consider:
<input type="text" value="my string" />

If my string comes from some other data source that isn't XSS-protected, it could conceivable contain something like:
"><script ....

which can use the original closing > of the input tag - and strip_tags may or may not catch that case. I seem to remember it looks for < followed by > which wouldn't be found in the above string.

Is strip_tags() vulnerable to scripting attacks?

As its name may suggest, strip_tags should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...') call, without a second argument for whitelisted tags.

First at all, some theory about HTML tags: a tag starts with a < followed by non-whitespace characters. If this string starts with a ?, it should not be parsed. If this string starts with a !--, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->, inside such a comment, characters like < and > are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character (' or "). If such a quote exist, it must be closed, otherwise if a > is encountered, the tag is not closed.

The code <a href="example>xxx</a><a href="second">text</a> is interpreted in Firefox as:

<a href="http://example.com%3Exxx%3C/a%3E%3Ca%20href=" second"="">text</a>

The PHP function strip_tags is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.

Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth holds the number of open angle brackets (<).

The variable in_q contains the quote character (' or ") if any, and 0 otherwise. The last character is stored in the variable lc.

The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:

  • State 0 is the output state (not in any tag)
  • State 1 means we are inside a normal html tag (the tag buffer contains <)
  • State 2 means we are inside a php tag
  • State 3: we came from the output state and encountered the < and ! characters (the tag buffer contains <!)
  • State 4: inside HTML comment

We need just to be careful that no tag can be inserted. That is, < followed by a non-whitespace character. Line 4326 checks an case with the < character which is described below:

  • If inside quotes (e.g. <a href="inside quotes">), the < character is ignored (removed from the output).
  • If the next character is a whitespace character, < is added to the output buffer.
  • if outside a HTML tag, the state becomes 1 ("inside HTML tag") and the last character lc is set to <
  • Otherwise, if inside the a HTML tag, the counter named depth is incremented and the character ignored.

If > is met while the tag is open (state == 1), in_q becomes 0 ("not in a quote") and state becomes 0 ("not in a tag"). The tag buffer is discarded.

Attribute checks (for characters like ' and ") are done on the tag buffer which is discarded. So the conclusion is:

strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.

By "outside tags", I mean not in tags as in <a href="in tag">outside tag</a>. Text may contain < and > though, as in >< a>>. The result is not valid HTML though, <, > and & need still to be escaped, especially the &. That can be done with htmlspecialchars().

The description for strip_tags without an whitelist argument would be:

Makes sure that no HTML tag exist in the returned string.

Should I use both striptags() and htmlspecialchars() to prevent XSS?

htmlspecialchars() is enough to prevent XSS.

Strip tags removes tags but not special characters like " or ', so if you use strip_tags() you also have to use htmlspecialchars().

If you want users' comments to be displayed like they typed them, don't use strip_tags, use htmlspecialchars() only.

Read HTML Tags from DB without XSS Vulnerabilities

first of all you can use a harmless bbcode in your commenting system for that matter, but i think you didn't understand strip_tags() well. strip_tags() has two arguments. first one is your string but second one is allowed tags (tags that can pass through strip_tags()) so it goes like this for example:

<?php
$text = '<p>Test texts.</p><!-- Comment --> <a href="#fragment">and other text</a>';
echo strip_tags($text);

# Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

and it outputs this:

<p>Test texts.</p> <a href="#fragment">and other text</a>

you can use strip_tags() documentation here

Is Markdown (with strip_tags) sufficient to stop XSS attacks?

I think stripping any HTML tag from the input will get you something pretty secure -- except if someone find a way to inject some really messed up data into Markdown, having it generate some even more messed-up output ^^

Still, here are two things that come to my mind :

First one : strip_tags is not a miracle function : it has some flaws...

For instance, it'll strip everything after the '<', in a situation like this one :

$str = "10 appels is <than 12 apples";
var_dump(strip_tags($str));

The output I get is :

string '10 appels is ' (length=13)

Which is not that nice for your users :-(


Second one : One day or another, you might want to allow some HTML tags/attributes ; or, even today, you might want to be sure that Markdown doesn't generate some HTML Tags/attributes.

You might be interested by something like HTMLPurifier : it allows you to specify which tags and attributes should be kept, and filters a string, so that only those remain.

It also generates valid HTML code -- which is always nice ;-)



Related Topics



Leave a reply



Submit