Regular Expression to Match All Comments in a T-SQL Script

Regular Expression to Match All Comments in a T-SQL Script

This should work:

(--.*)|(((/\*)+?[\w\W]+?(\*/)+))

Regex to find sql comments

With global modifier if your Regex engine accepts:

/\/\*.*?\*\/|--.*?\n/gs

s modifier is needed for multi-line comments matching.

Demo

Regex to remove single-line SQL comments (--)

I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.

EDIT:
Here's the c# code:

        String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
char[] quotes = { '\'', '"'};
int newCommentLiteral, lastCommentLiteral = 0;
while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
{
int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
{
int eol = sql.IndexOf("\r\n") + 2;
if (eol == -1)
eol = sql.Length; //no more newline, meaning end of the string
sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
lastCommentLiteral = newCommentLiteral;
}
else //this is within a string, find string ending and moving to it
{
int singleQuote = sql.IndexOf("'", newCommentLiteral);
if (singleQuote == -1)
singleQuote = sql.Length;
int doubleQuote = sql.IndexOf('"', newCommentLiteral);
if (doubleQuote == -1)
doubleQuote = sql.Length;

lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;

//instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
}
}

Console.WriteLine(sql);

What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.

Regular expression to select a particular content, provided it is not enclosed in comments

I don't know if you can do what you want with a single regular expression, especially since Oracle's implementation of regular expressions does not support lookaround. But there are some things you can do with SQL to get around these limitations. The following will extract the matches for the pattern, first by removing comments from the text, then by matching the patter src=".*\.js" in what remains. Multiple results are retrieved using CONNECT BY:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM (
SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script> -->
<!----<script type="text/javascript" src="js/Shop.js"></script> -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
FROM dual
)
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

If these results are stored in a table somewhere, then you would do the following:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM mytable
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

It seems strange but the final two lines is necessary to avoid duplicate results.

Results as follows:

| HTML_ID | MATCH                              |
+---------+------------------------------------+
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.serialize-object.js" |
| 1 | src="jquery.cookie.js" |
+---------+------------------------------------+

SQL Fiddle HERE.

Hope this helps.

EDIT: Edited according to my comment below:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
FROM (
SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
FROM (
SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script> -->
<!----<script type="text/javascript" src="js/Shop.js"></script> -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
FROM dual
)
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
AND PRIOR html_id = html_id
AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

EDITED

If you're searching a CLOB rather than a CHAR column, the first line of the CONNECT BY clause should look like this. REGEXP_SUBSTR() will return a CLOB if the relevant column is a CLOB, and the comparison just takes forever in this case:

CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL

Hope this helps.

MSSQL Regular expression

This is what I have used in the end:

SELECT *, 
CASE WHEN [url] NOT LIKE '%[^-A-Za-z0-9/.+$]%'
THEN 'Valid'
ELSE 'No valid'
END [Validate]
FROM
*table*
ORDER BY [Validate]

Java Regex find/replace pattern in SQL comments

I would do this like this :

    try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

The above will give you all comments without ';'. Then I would iterate line by line through the sql file and when I encountered a line which had a comment I would check to see if that line is in my list of matches - if not then I would search replace ; with ' ' in the whole comment. Of course you will have to find where the comment ends but this is easy -- ends in the same line and /* and when the first */ is found. This way you can change any number of ; with the same code.

RegEx: Grabbing values between quotation marks

I've been using the following with great success:

(["'])(?:(?=(\\?))\2.)*?\1

It supports nested quotes as well.

For those who want a deeper explanation of how this works, here's an explanation from user ephemient:

([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.



Related Topics



Leave a reply



Submit