Regex to Strip Comments and Multi-Line Comments and Empty Lines

Regex to strip comments and multi-line comments and empty lines

$text = preg_replace('!/\*.*?\*/!s', '', $text);
$text = preg_replace('/\n\s*\n/', "\n", $text);

Remove all comment (single-/multi-line) & blank lines from source file

To remove the comments, see this answer.
After that, removing empty lines is trivial.

Replace multi-line comment with empty lines

Assuming you do not have comment-like strings in string literals and that comments can't be nested (since the string comes from T-SQL), you can try

var rx = new Regex(@"/\*(?s:.*?)\*/");
var txt = @"/*
Some
multiline
comment
*/";
var replaced = rx.Replace(txt, m => String.Concat(Enumerable.Repeat("#\r\n", m.Value.Split(new string[] {"\r\n"}, StringSplitOptions.None).Count())).Trim());

Result:

Sample Image

The /\*(?s:.*?)\*/ regex matches any text between /* and */. The logic is that we get the whole match, split it with linebreaks, and then build a replacement string based on the number of lines.

If you want to match just the lines that are all-comments, you can use the following regex (see demo):

(?m)^\s*/\*(?s:.*?)\*/\s*$

How to strip comments starting at ** line form text in sql

You should use the replace twice if you want to remove comments and replace newlines with spaces inside text:

regexp_replace(regexp_replace($1, '((\n*)(\*\*.*)?)$', ''),'\n',' ')

Visual Studio regex to remove all comments and blank lines in VB.NET code using a macro

To get rid of a line that contains whitespace or nothing, you can use this regex:

(?m)^[ \t]*[\r\n]+

Your regex, ^[\s|\t]*$\n would work if you specified Multiline mode ((?m)), but it's still incorrect. For one thing, the | matches a literal |; there's no need to specify "or" in a character class. For another, \s matches any whitespace character, including TAB (\t), carriage-return (\r), and linefeed (\n), making it needlessly redundant and inefficient. For example, at the first blank line (after the end of the first Sub), the ^[\s|\t]* will initially try to match everything before the word Public, then it will back off to the end of the previous line, where the $\n can match.

But a blank line, in addition to being empty or containing only horizontal whitespace (spaces or TABs), may also contain a comment. I choose to treat these "comment-only" lines as blank lines because it's relatively easy to do, and it simplifies the task of matching comments in non-blank lines, which is much harder. Here's my regex:

^[ \t]*(?:(?:REM|')[^\r\n]*)?[\r\n]+

After consuming any leading horizontal whitespace, if I see a REM or ' signifying a comment, I consume that and everything after it until the next line separator. Notice that the only thing that's required to be present is the line separator itself. Also notice the absence of the end anchor, $. It's never necessary to use that when you're explicitly matching the line separators, and in this case it would break the regex. In Multiline mode, $ matches only before a linefeed (\n), not before a carriage-return (\r). (This behavior of the .NET flavor is incorrect and rather surprising, given Microsoft's longstanding preference for \r\n as a line separator.)

Matching the remaining comments is a fundamentally different task. As you've discovered, simply searching for REM or ' is no good because you might find it in a string literal, where it does not signify the start of a comment. What you have to do is start from the beginning of the line, consuming and capturing anything that's not the beginning of a comment or a string literal. If you find a double-quote, go ahead and consume the string literal. If you find a REM or ', stop capturing and go ahead and consume the rest of the line. Then you replace the whole line with just the captured portion--i.e., everything before the comment. Here's the regex:

(?mn)^(?<line>[^\r\n"R']*(("[^"]*"|(?!REM)R)[^\r\n"R']*)*)(REM|')[^\r\n]*

Or, more readably:

(?mn)             # Multiline and ExplicitCapture modes
^ # beginning of line
(?<line> # capture in group "line"
[^\r\n"R']* # any number of "safe" characters
(
(
"[^"]*" # a string literal
|
(?!REM)R # 'R' if it's not the beginning of 'REM'
)
[^\r\n"R']* # more "safe" characters
)*
) # stop capturing
(?:REM|') # a comment sigil
[^\r\n]* # consume the rest of the line

The replacement string would be "${line}". Some other notes:

  • Notice that this regex does not end with [\r\n]+ to consume the line separator, like the "blank lines" regex does.
  • It doesn't end with $ either, for the same reason as before. The [^\r\n]* will greedily consume everything before the line separator, so the anchor isn't needed.
  • The only thing that's required to be present is the REM or '; we don't bother matching any line that doesn't contain a comment.
  • ExplicitCapture mode means I can use (...) instead of (?:...) for all the groups I don't want to capture, but the named group, (?<line>...), still works.
  • Gnarly as it is, this regex would be a lot worse if VB supported multiline comments, or if its string literals supported backslash escapes.

I don't do VB, but here's a demo in C#.

PHP regex to remove single line comments

Regex isn't complex enough to (elegantly) do this in all cases, but you can use some assumptions. For instance: Since // can only be a) a comment or b) part of a string, you should be able to do the following:

\/\/[^;)]*$

This means that there may not be any ; or ) after the comment. This however only works when you don't use those in your comment. You can of course use any character like maybe ' and/or " to better fit your needs.

delete multi line comments line c# // or /*...*/

Alright, this Regex (^\/\/.*?$)|(\/\*.*?\*\/) (Rubular proof) will match (and potentially remove if you use Visual Studio and replace it with nothing) the following lines out of your example text:

//#define                0x00180000

// #define 0x20000000

// abcd

/*#define 0x00080000

#define 0x40000000*/

/* defg */

and almost gets you what you want. Now, I'm suspect of this line /\*#define 0x00000000*/, but if you wanted to capture it as well you could modify the Regex to be (^\/\/.*?$)|(\/.*?\*.*?\*\/) (Rubular proof).

Remove multi-line C style /* comments */ using Perl regex

I would do like,

perl -0777pe 's/\/\*(?:(?!\*\/).)*\*\/\n?//sg' file

Example:

$ cat fi
/* comments
comments
comments
comments */
bar
$ perl -0777pe 's/\/\*(?:(?!\*\/).)*\*\/\n?//sg' fi
bar


Related Topics



Leave a reply



Submit