Strip Tags and Everything in Between

Strip Tags and everything in between

As you’re dealing with HTML, you should use an HTML parser to process it correctly. You can use PHP’s DOMDocument and query the elements with DOMXPath, e.g.:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//h1') as $node) {
    $node->parentNode->removeChild($node);
}
$html = $doc->saveHTML();

How to remove text between tags in php?

$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)

Removing html image tags and everything in between from a string

I would vote that in your case it is acceptable to use a regular expression. Something like this should work:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)

edit: version which will only remove things of the form <img .... />:

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    return p.sub('', data)

Strip tags ( and )

if you do not have nested > and <, then you can try the following to match occurences:

$matches = array();
preg_match_all('/<([\s\S]*?)>/s', $string, $matches);

Try for yourserlf here. Note the ? in the query which makes the match in parentheses ungreedy.
You can find an answer to a similar question here on SO.

If you want to strip away the values, then use preg_replace_callback:

<?php
$string = '<p>This is a paragraph with <strong>bold</strong> text<p>';
echo "$string <br />";
$string = preg_replace_callback(
        '/<([\s\S]*?)>/s',
        function ($matches) {
            // do whatever you need with $matches here, e.g. save it somewhere
            return '';
        },
        $string
    );
echo $string;
?>

How would I remove all script tags (and everything in between) from multiple files using UNIX?

eg gawk

$ cat file
blah
<script type="text/javascript">function(foo);</script>
<script type="text/javascript" src="scripts.js"></script>
blah
<script type="text/javascript"
    src="script1.js">
</script>
end

$ awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' file
blah

blah

end

so run it inside a for loop to go over your files(eg html)

for file in *.html
do
  awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' $file >temp
  mv temp $file
done

You can also do it with Perl,

perl -i.bak -0777ne 's|<script.*?</script>||gms;print' *.html

Remove everything within script and style tags

Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.

<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);

removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);

// output cleaned html
echo $doc->saveHtml();

function removeElementsByTagName($tagName, $document) {
  $nodeList = $document->getElementsByTagName($tagName);
  for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
    $node = $nodeList->item($nodeIdx);
    $node->parentNode->removeChild($node);
  }
}

You can try it here: https://eval.in/private/4f225fa0dcb4eb

Documentation

DomDocument - http://php.net/manual/en/class.domdocument.php
DomNodeList - http://php.net/manual/en/class.domnodelist.php
DomDocument::getElementsByTagName - http://us3.php.net/manual/en/domdocument.getelementsbytagname.php

strip_tags disallow some tags

EDIT

To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:

require_once '/path/to/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

http://htmlpurifier.org/docs

HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:

array('script','style','applet')

Or:

array('<script>','<style>','<applet>')

Or... Something else?

I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:

tinyMCE.init({
    ...
    valid_elements : "a[href|target=_blank],strong/b,div[align],br",
    ...
});

So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.

I'll add this note from the HTML.ForbiddenAttributes docs, as well:

Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.

Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.

Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)

Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.

For instance:

function blacklistElements($blacklisted = '', &$errors = array()) {
    if ((string)$blacklisted == '') {
        $errors[] = 'Empty string.';
        return array();
    }

    $html5 = array(
        "<menu>","<command>","<summary>","<details>","<meter>","<progress>",
        "<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
        "<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
        "<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
        "<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
        "<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
        "<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
        "<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
        "<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
        "<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
        "<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
        "<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
        "<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
        "<title>","<head>","<html>"
    );

    $list = trim(strtolower($blacklisted));
    $list = preg_replace('/[^a-z ]/i', '', $list);
    $list = '<' . str_replace(' ', '> <', $list) . '>';
    $list = array_map('trim', explode(' ', $list));

    return array_diff($html5, $list);
}

Then run it:

$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);

if (count($errors)) {
    echo "There were errors.\n";
    print_r($errors);
    echo "\n";
} else {
    // Do strip_tags() ...
}

http://codepad.org/LV8ckRjd

So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:

$stripped = strip_tags($html, implode('', $whitelist)));

Caveat Emptor

Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:

Note:

This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.

It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:

<tagName>

I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.

So it's probably not production ready. But you get the idea.

Strip Tags and Everything in Between