Remove Script Tag from HTML Content

remove script tag from HTML content

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
$remove[] = $item;
}

foreach ($remove as $item)
{
$item->parentNode->removeChild($item);
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

How to remove all script tags from html file

If I understood correctly your question, and you want to delete everything inside <script></script>, I think you have to split the sed in parts (You can do it one-liner with ;):

Using:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'

The first piece (s/<script>.*<\/script>//g) will work for them when in one line;

The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;

The last two (s/<script>.*//g and s/.*<\/script>//g) finally take care of the lines that start and don't finish or don't start and finish.

Now if you have an index.html that has:

<html>
<body>
foo
<script> console.log("bar) </script>
<div id="something"></div>
<script>
// Multiple Lines script
// Blah blah
</script>
foo <script> //Some
console.log("script")</script> bar
</body>
</html>

and you run this sed command, you will get:

cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
<html>
<body>
foo


<div id="something"></div>




foo
bar
</body>

</html>

Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.

Hope it helps.

PS: I think that @l0b0 is right, and this is not the correct tool.

How do I remove a script tag by its content and its src?

This seems to work:

jQuery(document).ready(function($){
$('script').each(function() {

if (this.src === 'URL_HERE') {

this.parentNode.removeChild( this );
}
});
});

How to remove script tag from html using javascript

The content of your <script> tag is invalid Javascript, but here is one possible way of achieving what you are after:

// Function to convert an HTML string to a DOM element
String.prototype.toDOM = function () {
var d = document,
i,
a = d.createElement('div'),
b = d.createDocumentFragment();
a.innerHTML = this;
while (i = a.firstChild) {
b.appendChild(i);
}
return b;
};

// The <script> we wish to replace
var st = document.getElementById('carousel_ui_buttons_next-nav_next');

// Replace it with the <button> that is inside of it
st.parentNode.replaceChild(st.innerHTML.trim().toDOM(), st);

DOMDocument remove script tags from HTML source

Your error is actually trivial. A DOMNode object (and all its descendants - DOMElement, DOMNodeList and a few others!) is automatically updated when its parent element changes, most notably when its number of children change. This is written on a couple of lines in the PHP doc, but is mostly swept under the carpet.

If you loop using ($k instanceof DOMNode)->length, and subsequently remove elements from the nodes, you'll notice that the length property actually changes! I had to write my own library to counteract this and a few other quirks.

The solution:

if($dom->loadHTML($result))
{
while (($r = $dom->getElementsByTagName("script")) && $r->length) {
$r->item(0)->parentNode->removeChild($r->item(0));
}
echo $dom->saveHTML();

I'm not actually looping - just popping the first element one at a time. The result: http://sebrenauld.co.uk/domremovescript.php

How to remove script tags from an HTML page using C#?

It can be done using regex:

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");

How to remove a script tag reliably before it is loaded?

The script should be added instead

As discussed in the comments, it is unreliable to remove the script with an observer.

  • The browser can prefetch the file as soon as the src URL is known, independently of the HTML tag being rendered
  • The browser does not have to be honest or perfect. It may prefetch the file without listing it (yet) in the developer tools.
  • Not all versions of all browsers implement mutation observers: see caniuse MutationObserver. This may attract people using older browsers on purpose and complaining. So not a reliable way for legal requirements.

Instead the script should be added when needed:

(function(p,a,n,t,s){
t=p.createElement(a),s=p.getElementsByTagName(a)[0];
t.async=1;t.src=n;s.parentNode.insertBefore(t,s)
})(document,'script','https://www.googletagmanager.com/gtag/js?id=UA-12345678-9');

Testing your observer locally

The following will illustrate what may be the issue here.

I have created an index.html just as yours but with a locally loaded script test.js. I also added console.log(node); in your observer.

index.html

<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>MutationObserver test</title>

<link rel="shortcut icon" href="#">

<script src="https://code.jquery.com/jquery-3.6.0.min.js"
integrity="sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" crossorigin="anonymous"></script>

<script>

const observer = new MutationObserver( (mutations) =>{
mutations.forEach(({addedNodes}) => {
[...addedNodes]
.forEach(node => {
console.log(node);
$(node).remove()
});
});
});

observer.observe(document.head, { childList: true });

</script>

<script src="test.js"></script>


</head>

<body></body>

</html>

test.js

console.log("I got loaded before you could remove me.");

This results in the following console log:

index.html:21 " "
index.html:21 <script src=​"test.js">​</script>​
index.html:21 " "

So the observer is working, as it also catches empty nodes with just whitespace characters.

However in the Network tab the test.js is still listed as loaded:

Name        Status      Type        Initiator   Size        Time
index.html Finished document Other 972 B 1 ms
test.js Finished script index.html 56 B 3 ms
jquery-3... 200 script index.html 31.0 kB 10 ms
index.html Finished text/html Other 972 B 1 ms

It seems as though the HTML node was removed, the browser is prefetching the code. This likely happens because the HTML document itself will transfer regardless. Then the browser parses the HTML and sees all files to be requested. As soon as the HTML document is turned into a DOM the browser runs the observer, removes the node and just in time will not execute the JavaScript from the nodes after the observer.

See this counterexample where I comment out the observing part:

// observer.observe(document.head, { childList: true });

In this case the console output will read:

test.js:1 I got loaded before you could remove me.

The network panel will look nothing different. So it seems as I'm using Chrome, it tries to perfect speed and will load and cache the JavaScript but it will not execute it since its node got removed.

Possible TagManager / Google Anyltics insights

If you have access to alter the TagManager's tags you might just add <script>console.log("TagManager code executed");</script> and confirm this way code execution is correctly suppressed for the removed tag. So confirming the same as my example with a local JS file but in TagManager directly.

Another way is Google TagManager's excellent Debug mode.

If Google Analytics is used here as well you might see your requests (or lack thereof) in the GA Real-Time view (well... bad documentation with no screenshots. Just look for real-time in Analytics you will find it).

Remove script tags only from variable containing entire web page

You can do:

$("script", AVMI_tree).remove();

But mind that you're getting the OuterHTML of documentElement, that includes Head and BODY, and putting them into a DIV, which is illegal.

You could do:

var htmlPage = $("html");
$("script", htmlPage).remove();
AVMI_thisPage = htmlPage.html();

Mind that it doesn't matter that you're actually removing the SCRIPTS fro the HTML page rather than from a copied DOM, because once a loaded script has been processed and loaded by the JVM, it doesn't matter if you remove it from the DOM: The script will be loaded and active.



Related Topics



Leave a reply



Submit