Strip Tags and everything in between
As you’re dealing with HTML, you should use an HTML parser to process it correctly. You can use PHP’s DOMDocument and query the elements with DOMXPath, e.g.:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//h1') as $node) {
$node->parentNode->removeChild($node);
}
$html = $doc->saveHTML();
How to remove text between tags in php?
$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)
Removing html image tags and everything in between from a string
I would vote that in your case it is acceptable to use a regular expression. Something like this should work:
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)
edit: version which will only remove things of the form <img .... />
:
def remove_img_tags(data):
p = re.compile(r'<img.*?/>')
return p.sub('', data)
Strip tags ( and )
if you do not have nested >
and <
, then you can try the following to match occurences:
$matches = array();
preg_match_all('/<([\s\S]*?)>/s', $string, $matches);
Try for yourserlf here. Note the ?
in the query which makes the match in parentheses ungreedy.
You can find an answer to a similar question here on SO.
If you want to strip away the values, then use preg_replace_callback
:
<?php
$string = '<p>This is a paragraph with <strong>bold</strong> text<p>';
echo "$string <br />";
$string = preg_replace_callback(
'/<([\s\S]*?)>/s',
function ($matches) {
// do whatever you need with $matches here, e.g. save it somewhere
return '';
},
$string
);
echo $string;
?>
How would I remove all script tags (and everything in between) from multiple files using UNIX?
eg gawk
$ cat file
blah
<script type="text/javascript">function(foo);</script>
<script type="text/javascript" src="scripts.js"></script>
blah
<script type="text/javascript"
src="script1.js">
</script>
end
$ awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' file
blah
blah
end
so run it inside a for loop to go over your files(eg html)
for file in *.html
do
awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' $file >temp
mv temp $file
done
You can also do it with Perl,
perl -i.bak -0777ne 's|<script.*?</script>||gms;print' *.html
Remove everything within script and style tags
Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.
<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';
// create a new DomDocument object
$doc = new DOMDocument();
// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);
removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);
// output cleaned html
echo $doc->saveHtml();
function removeElementsByTagName($tagName, $document) {
$nodeList = $document->getElementsByTagName($tagName);
for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
$node = $nodeList->item($nodeIdx);
$node->parentNode->removeChild($node);
}
}
You can try it here: https://eval.in/private/4f225fa0dcb4eb
Documentation
DomDocument
- http://php.net/manual/en/class.domdocument.phpDomNodeList
- http://php.net/manual/en/class.domnodelist.phpDomDocument::getElementsByTagName
- http://us3.php.net/manual/en/domdocument.getelementsbytagname.php
strip_tags disallow some tags
EDIT
To use the HTML Purifier HTML.ForbiddenElements
config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements
should be set to an array
. What I don't know is what form the array
members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements
uses a form of configuration string somewhat common to TinyMCE's valid elements
syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes
, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes
docs, as well:
Warning: This directive complements
%HTML.ForbiddenElements
,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements
configuration directive, I think a reasonable alternative if you really, really want to use strip_tags()
is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array
form that you can then feed into strip_tags()
after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags()
man page for the $allowable_tags
argument:
Note:
This parameter should not contain whitespace.
strip_tags()
sees a tag
as a case-insensitive string between<
and the first whitespace or>
.
It means thatstrip_tags("<br/>", "<br>")
returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5
element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/>
and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.
Related Topics
Mysqli_Query() Expects at Least 2 Parameters, 1 Given In
Using the PHP Http_Accept_Language Server Variable
PHP Prepend Leading Zero Before Single Digit Number, On-The-Fly
Prevent Nginx 504 Gateway Timeout Using PHP Set_Time_Limit()
How to Explode and Trim Whitespace
Get Woocommerce Product Categories from Wordpress
How to Determine the Memory Footprint (Size) of a Variable
Laravel 4: How to "Order By" Using Eloquent Orm
How to Remove Part of a String in PHP
Ip Address Storing in MySQL Database Using PHP
Insert Current Date in Datetime Format MySQL
Prevent Direct Url Access to PHP File
New Csrf Token Per Request or Not
Why Does PHP Convert a String with the Letter E into a Number
Error 403 in Loading Resources Like CSS and Js in My Index.Php