Regex to Strip Line Comments from C#

Regex to strip line comments from C#

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

Replace the block comments with nothing
Replace the line comments with a newline (because the regex eats the newline)
Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Remove all comment (single-/multi-line) & blank lines from source file

To remove the comments, see this answer.
After that, removing empty lines is trivial.

Removing comments using regex

Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.

The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.

When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.

One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.

The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.

This is the one-liner pattern to use:

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

If you want to use the ignore pattern whitespace option, you can use this version:

(?x) # Turn on the ignore white space option
^( # Start the only capturing group
    (?: # A non-capturing group to allow for repeating the logic
        (?: # Capture either of the two options below
            [^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
            | # OR
            /(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
        )* # As many times as possible, even if none
        (?(" # Start a conditional match for double-quoted strings
                (?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
            ) # Then
            (?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
            | # Otherwise
            (?(' # Start a conditional match for single-quoted strings
                (?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
                ) # Then
                (?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
                | # Otherwise
                (?([^/]) # If next character is not a slash
                .) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
            ) # end the conditional match for single-quoted strings
        ) # End the conditional match for double-quoted strings
    )* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group

This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.

As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.

Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!

Remove all lines (comments) starting with ** by using Regex (.NET Framework, C#)

Why not just:

var text = @"** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6";

var textWithoutComments = Regex.Replace(text, @"(^|\n)\*\*.*(?=\n)", string.Empty); //this version will leave a \n at the beginning of the string if the text starts with a comment.
var textWithoutComments = Regex.Replace(text, @"(^\*\*.*\r\n)|((\r\n)\*\*.*($|(?=\r\n)))", string.Empty); //this versioh deals with that problem, for a longer regex that treats the first line differently than the other lines (consumes the \n rather than leaving it in the text)

Don't know about performance, I don't have test data at the ready...

PS: I also am inclined to believe that if you want top performance, some streaming might be ideal, you can always return a string from the method if that makes things easier for later processing. I think most people in this thread are suggesting StreamReader for the iteration/reading/interpreting part, regardless of the return type you decide to build.

Strip out C Style Multi-line Comments

Use a RegexOptions.Multiline option parameter.

string output = Regex.Replace(input, pattern, string.Empty, RegexOptions.Multiline);

Full example

string input = @"this is some stuff right here
    /* blah blah blah 
    blah blah blah 
    blah blah blah */ and this is more stuff
    right here.";

string pattern = @"/[*][\w\d\s]+[*]/";

string output = Regex.Replace(input, pattern, string.Empty, RegexOptions.Multiline);
Console.WriteLine(output);

C# removing comments from a string

Replace the || in the if-statement by && and use "\r\n" instead of "\n". Try this:

var lines = textBox2.SelectedText.Split(new [] {"\r\n"}, StringSplitOptions.None);
var sb = new StringBuilder();

for (int i = 0; i < lines.Length; i++)
{
    var line = lines[i].Trim();
    if ((line != string.Empty) && !Regex.IsMatch(line, @"^\s*;(.*)$"))
    {
        if (Regex.IsMatch(line, @"^(.*);(.*)$"))
            sb.AppendLine(line.Substring(0, line.IndexOf(';')).Trim());
        else
            sb.AppendLine(line);
    }
}
textBox2.SelectedText = sb.ToString();

Or with LinQ and "?:" expression:

var lines = textBox2.SelectedText .Split(new [] {"\r\n"}, StringSplitOptions.None);
var sb = new StringBuilder();

foreach (var line in lines.Select(t => t.Trim())
                          .Where(line => (line != string.Empty) && !Regex.IsMatch(line, @"^\s*;(.*)$")))
{
    sb.AppendLine(Regex.IsMatch(line, @"^(.*);(.*)$") ? line.Substring(0, line.IndexOf(';')).Trim() : line);
}
textBox2.SelectedText = sb.ToString();

How can I strip in-line comments from a text reader

If you are comfortable with regex

string pattern="(?s)/[*].*?[*]/";
var output=Regex.Replace(File.ReadAllText(path),pattern,"");

. would match any character other then newline.
(?s) toggles the single line mode in which . would also match newlines..
.* would match 0 to many characters where * is a quantifier
.*? would match lazily i.e it would match as less as possible

NOTE

That won't work if a string within "" contain /*..You should use a parser instead!

Regular expression to find and remove comments in CSS

If you're running the match in C#, have you tried RegexOptions?

Match m = Regex.Match(word, pattern, RegexOptions.Multiline);

"Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string."

Also see Strip out C Style Multi-line Comments

EDIT:

OK..looks like an issue w/ the regex. Here is a working example using the regex pattern from http://ostermiller.org/findcomment.html. This guy does a good job deriving the regex, and demonstrating the pitfalls and deficiencies of various approaches. Note: RegexOptions.Multiline/RegexOptions.Singleline does not appear to affect the result.

string input = @"this is some stuff right here
    /* blah blah blah 
    blah blah blah 
    blah blah blah */ and this is more stuff /* blah */
    right here.";

string pattern = @"(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)";
string output = Regex.Replace(input, pattern, string.Empty, RegexOptions.Singleline);

delete multi line comments line c# // or /.../

Alright, this Regex (^\/\/.*?$)|(\/\*.*?\*\/) (Rubular proof) will match (and potentially remove if you use Visual Studio and replace it with nothing) the following lines out of your example text:

//#define                0x00180000

// #define                0x20000000

// abcd

/*#define                0x00080000

   #define               0x40000000*/

/* defg  */

and almost gets you what you want. Now, I'm suspect of this line /\*#define 0x00000000*/, but if you wanted to capture it as well you could modify the Regex to be (^\/\/.*?$)|(\/.*?\*.*?\*\/) (Rubular proof).

Regex to Strip Line Comments from C#