Regexp to Strip HTML Comments

RegExp to strip HTML comments

Are you just trying to remove the comments? How about

s/<!--[^>]*-->//g

or the slightly better (suggested by the questioner himself):

<!--(.*?)-->

But remember, HTML is not regular, so using regular expressions to parse it will lead you into a world of hurt when somebody throws bizarre edge cases at it.

How to remove html comments from a string in Javascript

like this

var str = `<div></div>

<!-- some comment -->

<p></p>

<!-- some comment -->`

str = str.replace(/<\!--.*?-->/g, "");

console.log(str)

Remove HTML comments with Regex, in Javascript

The regex /<!--[\s\S]*?-->/g should work.

You're going to kill escaping text spans in CDATA blocks.

E.g.

<script><!-- notACommentHere() --></script>

and literal text in formatted code blocks

<xmp>I'm demoing HTML <!-- comments --></xmp>

<textarea><!-- Not a comment either --></textarea>

EDIT:

This also won't prevent new comments from being introduced as in

<!-<!-- A comment -->- not comment text -->

which after one round of that regexp would become

<!-- not comment text -->

If this is a problem, you can escape < that are not part of a comment or tag (complicated to get right) or you can loop and replace as above until the string settles down.


Here's a regex that will match comments including psuedo-comments and unclosed comments per the HTML-5 spec. The CDATA section are only strictly allowed in foreign XML. This suffers the same caveats as above.

var COMMENT_PSEUDO_COMMENT_OR_LT_BANG = new RegExp(
'<!--[\\s\\S]*?(?:-->)?'
+ '<!---+>?' // A comment with no body
+ '|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?'
+ '|<[?][^>]*>?', // A pseudo-comment
'g');

Remove almost all HTML comments using Regex

This should replace alle the comments which doesn't contain "batcache". The matching is done between this two tags: <!-- to --> .

$result = preg_replace("/<!--((?!batcache)(?!\\[endif\\])[\\s\\S])*?-->/", "", $str);

You can test it here.

As already stated by other users it's not always safe to parse HTML with regex but if you have a relative assurance of what kind of HTML you will parse it should work as expected. If the regex doesn't match some particular usecase let me know.

Using Regular Expression remove HTML comments from content

looks like you are missing something.

 $content = preg_replace( '/<!--(.|\s)*?-->/' , '' , $content );

You can test it here http://www.phpliveregex.com/p/1LX

delete html comment tags using regexp

patrickmdnet has the correct answer. Here it is on one line using extended regex:

cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92

http://www.catonmat.net/blog/sed-one-liners-explained-part-three/

How to remove all conditional HTML comments?

Here are two important facts about (f)lex regular expressions. (See the flex manual for complete documentation of Flex patterns. The section is not very long.)

  1. In (f)lex, the . wildcard matches anything except a newline character. In other words, it is equivalent to [^\n]. So "<!".* will only match to the end of the line. You could fix that by using (.|\n) instead, but see below.

  2. (F)lex does not provide non-greedy repetition (*?). All repetitions extend to the longest possible match. (.*?)--> will therefore match up to the last --> on the line, and (.|\n)*?--> would match up to the last --> in the file.

It is possible to write a regular expression which does what you want, although it's a bit messy:

<!--([^-]|-[^-]|--+[^->])*--+>

should work, as long as the input text does not end with an unterminated comment. (The quotes in your pattern are unnecessary, since none of the quoted characters has any special meaning to (f)lex, but they don't hurt. I left them out because I don't think they contribute to make the pattern less unreadable.)

The repeated sequence matches any of:

  • A character other then -; or
  • A - followed by something other than another -; or
  • Two or more - followed by something other than >.

The last alternative in the repetition might require some explanation. The underlying problem is to avoid problems with inputs like

<!-- Comment with two many dashes --->

If we'd just written the tempting --[^>] as the third alternative, ---> would not be recognised as terminating the pattern, since --- would match --[^>] (a dash is not a right angle bracket) and > would then match [^-], and the scan would continue. Adding the + to match a longer sequence of dashes is not enough, because, like many regex engines, (f)lex is looking for the longest overall match, not the longest submatch in each set of alternatives. So we need to write --+[^->], which cannot match ---.

If that was not clear -- and I can see why it wouldn't be --, you could instead use a start condition to write a much simpler set of patterns:

%x COMMENT
%%
"<!--" { BEGIN(COMMENT); }
<COMMENT>{
"-->" { BEGIN(INITIAL); }
[^-]+ ;
.|\n ;
}

The second <COMMENT> rule is really just an efficiency hack; it avoids triggering a no-op action on every character. With the second rule in place, the last rule really can only match a single -, so it could have been written that way. But writing it in full allows you to remove the second rule and demonstrate to yourself that it works without it.

The key insight for matching the comment in pieces like this is that (f)lex always chooses the longest match, which is in some ways similar to the goal of non-greedy matches. While inside the <COMMENT> start condition, - will only match the single character fallback rule if it cannot be part of the match of -->, which is longer.



Related Topics



Leave a reply



Submit