How to Remove All Script Tags from HTML File

How to remove all script tags from html file

If I understood correctly your question, and you want to delete everything inside <script></script>, I think you have to split the sed in parts (You can do it one-liner with ;):

Using:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'

The first piece (s/<script>.*<\/script>//g) will work for them when in one line;

The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;

The last two (s/<script>.*//g and s/.*<\/script>//g) finally take care of the lines that start and don't finish or don't start and finish.

Now if you have an index.html that has:

<html>
  <body>
        foo
        <script> console.log("bar) </script>
  <div id="something"></div>
        <script>
                // Multiple Lines script
                // Blah blah
        </script>
        foo <script> //Some
        console.log("script")</script> bar
  </body>
</html>

and you run this sed command, you will get:

cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
<html>
  <body>
    foo


        <div id="something"></div>




    foo 
 bar
  </body>

</html>

Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.

Hope it helps.

PS: I think that @l0b0 is right, and this is not the correct tool.

Removing all script tags from html with JS Regular Expression

Attempting to remove HTML markup using a regular expression is problematic. You don't know what's in there as script or attribute values. One way is to insert it as the innerHTML of a div, remove any script elements and return the innerHTML, e.g.

  function stripScripts(s) {
    var div = document.createElement('div');
    div.innerHTML = s;
    var scripts = div.getElementsByTagName('script');
    var i = scripts.length;
    while (i--) {
      scripts[i].parentNode.removeChild(scripts[i]);
    }
    return div.innerHTML;
  }

alert(
 stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>')
);

Note that at present, browsers will not execute the script if inserted using the innerHTML property, and likely never will especially as the element is not added to the document.

how to remove all script tags in a html content with CsQuery

Solved. This code removes all scripts.

dom = dom["body script"].Remove();

How to remove script tags from an HTML page using C#?

It can be done using regex:

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");

remove script tag from HTML content

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

How would I remove all script tags (and everything in between) from multiple files using UNIX?

eg gawk

$ cat file
blah
<script type="text/javascript">function(foo);</script>
<script type="text/javascript" src="scripts.js"></script>
blah
<script type="text/javascript"
    src="script1.js">
</script>
end

$ awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' file
blah




blah


end

so run it inside a for loop to go over your files(eg html)

for file in *.html
do
  awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' $file >temp
  mv temp $file
done

You can also do it with Perl,

perl -i.bak -0777ne 's|<script.*?</script>||gms;print' *.html

How can I save an html file and remove script and unwanted tag when dowloading?

Set the class attribute to something like delete and use querySelectorAll(".delete, script"). Then loop through these elements and use element.remove() in your clean() function in addition to removing the contentEditable from the other elements.

body {
  margin: 0;
  padding: 0;
  background-color: #f2f2f2;
}

p {
  margin: 0 0 8px;
}

<!--Delete Start-->
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js" integrity="sha512-894YE6QWD5I59HgZOGReFYm4dnWc1Qt5NtvYSaNcOP+u1T9qYdvdihz0PPSiiqn/+/3e7Jo4EaG7TubfWGUrMQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
<!--Delete End-->

<p contenteditable="true">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
  in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

<!--Delete Start-->
<div class="delete">
  <input type="button" id="add" value="add content">
  <input type="button" id="del" value="remove content">
</div>
<!--Delete End-->

<!--Delete Start-->
<div class="delete">
  <a href="#" id="download-link" onClick="myFunction()">download html content</a>
</div>
<!--Delete End-->

<!--Delete Start-->
<script>
  function myFunction() {
    clean();
    var content = document.documentElement.innerHTML;
    download(content, "index", "html");
  }

  function clean() {
    var contentToDelete = document.querySelectorAll(".delete, script");
    var editableContent = document.querySelectorAll("[contenteditable=true]");
    for (var i = 0; i < editableContent.length; i++) {
      editableContent[i].removeAttribute('contenteditable');
    }
    contentToDelete.forEach((element) => element.remove());
  }

  function download(content, fileName, fileType) {
    var link = document.createElement("a");
    var file = new Blob([content], {
      type: "html"
    });
    var downloadFile = fileName + "." + fileType;
    link.href = URL.createObjectURL(file);
    link.download = downloadFile;
    link.click();
  }
</script>
<!--Delete End-->

How to Remove All Script Tags from HTML File