Remove C and C++ Comments Using Python

Remove C and C++ comments using Python?

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:

import subprocess
from cStringIO import StringIO

input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
    input=input, output=output)
return_code = process.wait()

stripped_code = output.getvalue()

In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.

This will probably be better than a pure Python solution; no need to reinvent the wheel.

How can I delete multiline comments in C file to using Python?

I'm not looking too much in the details, but you are using pattern3 for both start and end of comments. I guess it should be something like

    if pattern3.search(line) and cmnt_e == True:
        cmnt_s = True
        cmnt_e = False
    if pattern4.search(line) and cmnt_s == True:
        cmnt_e = True
        cmnt_s = False

I would also be wary of strange patterns like this one (yes, I use this sometime for experimental stuffs, you just flip the first slash to flip the active code) :

//*
ACTIVE_CODE
/*/
INACTIVE_CODE
//*/

Stripping off C/C++ comments from a source file using python

If you plan using the current regexp, here is what you can do to match //... comments:

Below this:

 /                ##  End of /* ... */ comment

Add this:

 |                  ## OR it is a line comment with //
  \s*//.*           ## Single line comment

See demo

Python program to remove C type single and multiple line comments

This one:

commentEnd.match(line)

Should be:

commentEnd.search(line)

From the docs:

If you want to locate a match anywhere in string, use search() instead

Removing comments in C language

In order to ignore the escaped newlines, sequences of \ followed by a newline, you could use a function that handles this transparently.

Note also these issues:

ch must be defined as an int to handle EOF correctly.
the macros defined in <iso646.h> make the code less readable.
\ should be handled when parsing strings.
character constants should be parsed too: '//' is a valid character constant, not a comment.

// Function for output to console\
    ns2
/\
*\ This is a valid comment too :) 
*\
/

#define _CRT_SECURE_NO_WARNINGS 
#include <stdio.h>

int mygetc(FILE *in) {
    for (;;) {
        int c = getc(in);
        if (c == '\\') {
            c = getc(in);
            if (c == '\n')
                continue;
            if (c != EOF)
                ungetc(c, in);
            c = '\\';
        }
        return c;
    }
}

int skip_line_comment(FILE *in) {
    int c;
    while ((c = mygetc(in)) != '\n' && c != EOF)
        continue;
    return c;
}

int skip_block_comment(FILE *in) {
    int c;
    for (;;) {
        while ((c = mygetc(in)) != '*') {
            if (c == EOF)
                return c;
        }
        while ((c = mygetc(in)) == '*')
            continue;
        if (c == EOF)
            return c;
        if (c == '/')
            return ' ';
    }
}

int main() {
    FILE *in = fopen("inp.c", "r");
    FILE *out = fopen("out.c", "w");
    int ch;
    while ((ch = mygetc(in)) != EOF) {
        if (ch == '/') {
            ch = skip_line_comment(in);
        } else
        if (ch == '*') {
            ch = skip_block_comment(in);
        } else
        if (ch == '"' || ch == '\'') {
            int sep = ch;
            fputc(ch, out);
            while ((ch = mygetc(in)) != sep && ch != EOF) {
                fputc(ch, out);
                if (ch == '\\') {
                    ch = mygetc(in);
                    if (ch == EOF)
                        break;
                    fputc(ch, out);
                }
            }
        }
        if (ch == EOF)
            break;
        fputc(ch, out);
    }
    fclose(in);
    fclose(out);
    return 0;
}

Using regex to remove comments from source files

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENT\n ) from string
    return string

Regex code for removing single and multi-line comments from C code

Try one of these...

They both capture comments and non-comments.

This one does Not preserve formatting and uses no modifiers.

From a find while loop, store Group 1 (comments) in a new file,

replace with Group 2 (non-comments) in the original file.

Adjust the regex line break as necessary. Ie. Change \n to \r\n etc...

   # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


   (                                # (1 start), Comments 
        /\*                              # Start /* .. */ comment
        [^*]* \*+
        (?: [^/*] [^*]* \*+ )*
        /                                # End /* .. */ comment
     |  
        //                               # Start // comment
        (?: [^\\] | \\ \n? )*?           # Possible line-continuation
        \n                               # End // comment
   )                                # (1 end)
|  
   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

Last Rework -

Does a much better job preserving formatting.

The formatting problem pertaining to newlines is addressed from the comment tail.

While this fixes the problem of string concatenation it does leave an occasional blanked

line where the comment was. For %98 of the comments, this won't be an issue.

But, time to leave this dead dog alone.

This one preserves formatting. It uses the regex modifier Multi-Line (be sure to set that).

Do the same as above.

This assumes your engine supports \h horizontal tab. If not let me know.

Adjust the regex line break as necessary. Ie. Change \n to \r\n etc...

   #  ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)

   (                                # (1 start), Comments 
        (?:
             (?: ^ \h* )?                     # <- To preserve formatting
             (?:
                  /\*                              # Start /* .. */ comment
                  [^*]* \*+
                  (?: [^/*] [^*]* \*+ )*
                  /                                # End /* .. */ comment
                  (?:
                       \h* \n                                      
                       (?=                              # <- To preserve formatting 
                            \h*                              # <- To preserve formatting
                            (?: \n | /\* | // )              # <- To preserve formatting
                       )
                  )?                               # <- To preserve formatting
               |  
                  //                               # Start // comment
                  (?: [^\\] | \\ \n? )*?           # Possible line-continuation
                  (?:                              # End // comment
                       \n                               
                       (?=                              # <- To preserve formatting
                            \h*                              # <- To preserve formatting
                            (?: \n | /\* | // )              # <- To preserve formatting
                       )
                    |  (?= \n )
                  )
             )
        )+                               # Grab multiple comment blocks if need be
   )                                # (1 end)

|                                 ## OR

   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\\\s]*                      # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

How to remove C-style comments from code

I've considered the comments (so far) and changed the regex to:

(?:\/\/(?:\\\n|[^\n])*\n)|(?:\/\*[\s\S]*?\*\/)|((?:R"([^(\\\s]{0,16})\([^)]*\)\2")|(?:@"[^"]*?")|(?:"(?:\?\?'|\\\\|\\"|\\\n|[^"])*?")|(?:'(?:\\\\|\\'|\\\n|[^'])*?'))

It handles Biffens C++11's raw string literal (as well as C# verbatim strings) and it's changed according to Wiktors suggestions.

Split it to handling single and double quotes separately because of difference in logic (and avoiding the non-working back reference ;).

It's undoubtedly more complex, but still far from the solutions I've seen out there which hardly cover any of the string issues. And it could be stripped of parts not applicable to a specific language.

One comment suggested supporting more languages. That would make the RE (even more) complex and unmanageable. It should be relatively easy to adapt though.

Updated regex101 example.

Thanks everyone for the input so far. And keep the suggestions coming.

Regards

Edit: Update Raw String - this time I actually read the spec. ;)

Remove part of the string between the comments along with the comments

Here is a simple algorithm that keep the state over 2 characters and uses a flag to keep or not the characters.

a = "word234 /*12aaa12*/ word123 /*xx*xx*/ end"

out = []
add = True
prev = None
for c in a:
    if c == '*' and prev == '/':
        if add:
            del out[-1]
        add = False
    if c == '/' and prev == '*':
        add = True
        prev = c
        continue
    prev = c
    if add:
        out.append(c)
s2 = ''.join(out)
print(s2)

Output:

word234  word123  end

If you want to handle nested comments (not sure if this exists, but this is fun to do), the algorithm is easy to modify to use a flag that counts the depth level:

a = "word234 /*12aaa12*/ word123 /*xx/*yy*/xx*/ end"

out = []
lvl = 0
prev = None
for c in a:
    if c == '*' and prev == '/':
        if lvl == 0:
            del out[-1]
        lvl -= 1
    if c == '/' and prev == '*':
        lvl += 1
        prev = c
        continue
    prev = c
    if lvl == 0:
        out.append(c)
s2 = ''.join(out)
print(s2)

Remove C and C++ Comments Using Python