Regex Match Keywords That Are Not in Quotes

regex match keywords that are not in quotes

Here is one answer:

(?<=^([^"]|"[^"]*")*)text

This means:

(?<=       # preceded by...
^ # the start of the string, then
([^"] # either not a quote character
|"[^"]*" # or a full string
)* # as many times as you want
)
text # then the text

You can easily extend this to handle strings containing escapes as well.

In C# code:

Regex.Match("bla bla bla \"this text is inside a string\"",
"(?<=^([^\"]|\"[^\"]*\")*)text", RegexOptions.ExplicitCapture);

Added from comment discussion - extended version (match on a per-line basis and handle escapes). Use RegexOptions.Multiline for this:

(?<=^([^"\r\n]|"([^"\\\r\n]|\\.)*")*)text

In a C# string this looks like:

"(?<=^([^\"\r\n]|\"([^\"\\\\\r\n]|\\\\.)*\")*)text"

Since you now want to use ** instead of " here is a version for that:

(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text

Explanation:

(?<=       # preceded by
^ # start of line
( # either
[^*\r\n]| # not a star or line break
\*(?!\*)| # or a single star (star not followed by another star)
\*\* # or 2 stars, followed by...
([^*\\\r\n] # either: not a star or a backslash or a linebreak
|\\. # or an escaped char
|\*(?!\*) # or a single star
)* # as many times as you want
\*\* # ended with 2 stars
)* # as many times as you want
)
text # then the text

Since this version doesn't contain " characters it's cleaner to use a literal string:

@"(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text"

Regex match all words except those between quotes

You can match strings between double quotes and then match and capture words optionally followed with dot separated words:

list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))

See the regex demo. Details:

  • "[^"]*" - a " char, zero or more chars other than " and then a " char
  • | - or
  • ([a-z_]\w*(?:\.[a-z_]\w*)*) - Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a . and then a letter or underscore followed with zero or more word chars.

See the Python demo:

import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']

The re.ASCII option is used to make \w match [a-zA-Z0-9_] without accounting for Unicode chars.

Regex match word in a text but not in quotes or comments

You need to match the contexts you need to discard, and then match and capture those occurrences of your pattern that you need to modify:

/(?<!\\(?:\\{2})*)"[^"\\]*(?:\\[\s\S][^\\"]*)*"|\(\*[\s\S]*?\*\)|\b(true|false|exit|continue|return|constant|retain|public|private|protected|abstract|persistent|internal|final|of|else|elsif|then|__try|__catch|__finally|__endtry|do|to|by|task|with|using|uses|from|until|or|or_else|and|and_then|not|xor|nor|ge|le|eq|ne|gt|lt|__new|__delete|extends|implements|this|super|AT|BOOL|BYTE|(?:D|L)?WORD|U?(?:S|D|L)?INT|L?REAL|TIME(?:_OF_DAY)?|TOD|DT|DATE(?:_AND_TIME)?|STRING|ARRAY|ANY)\b/gi

See this regex demo.

I changed the first (?: in your pattern to ( so that your expected match is captured into Group 1, and added (?<!\\(?:\\{2})*)"[^"\\]*(?:\\[\s\S][^\\"]*)*"|\(\*[\s\S]*?\*\)| at the start of the pattern:

  • (?<!\\(?:\\{2})*)"[^"\\]*(?:\\[\s\S][^\\"]*)*" - a location not preceded with a backslash optionally followed with any even amount of backslashes and then a double quoted string with escape sequence support
  • | - or
  • \(\*[\s\S]*?\*\) - (*, then any 0+ chars, as few as possible and then *).

See JavaScript demo:

const keywords = [
'true', 'false', 'exit', 'continue', 'return', 'constant', 'retain',
'public', 'private', 'protected', 'abstract','persistent','internal',
'final','of','else','elsif','then','__try','__catch','__finally',
'__endtry','do','to','by','task','with','using','uses','from',
'until','or','or_else','and','and_then','not','xor','nor','ge',
'le','eq','ne','gt','lt','__new','__delete', 'extends','implements',
'this','super'
];
const regEx = new RegExp(String.raw`(?<!\\(?:\\{2})*)"[^"\\]*(?:\\.[^\\"]*)*"|\(\*.*?\*\)|\b(${keywords.join('|')}|AT|BOOL|BYTE|(?:D|L)?WORD|U?(?:S|D|L)?INT|L?REAL|TIME(?:_OF_DAY)?|TOD|DT|DATE(?:_AND_TIME)?|STRING|ARRAY|ANY)\b`, "igs");
let text = "TYPE MyStruct : STRUCT\n this.var1 : POINTER TO INT; (* Указатель 1 *)\n var2 : POINTER TO INT; (* this is Указатель 2 *)\n sStr: STRING(200) := \"This \n Test this line\"; \n sStr: STRING(200) := \"Test this line\"; \n sStr: STRING(200) := 'Test this line'; \n END_STRUCT\nEND_TYPE\n\nTHIS.MyStruct := 100;";
text = text.replace(regEx, (match,group) => {
return group != undefined ? match.toUpperCase() : match;
});
console.log(text);

Regex to match all instances not inside quotes

Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.

The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:

\+(?=([^"]*"[^"]*")*[^"]*$)

Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at

\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)

I admit it is a little cryptic. =)

Regex to detect words that are not quoted

You can use

import re
a = '''big mouse eats cheese? "non-detected string" 'non-detected string too' hello guys'''
print( [x for x in re.findall(r'''"[^"]*"|'[^']*'|\b([^\d\W]+)\b''', a) if x])
# => ['big', 'mouse', 'eats', 'cheese', 'hello', 'guys']

See the Python demo. The list comprehension is used to post-process the output to remove empty items that result from matching the quoted substrings.

This approach works because re.findall only returns the captured substrings when the capturing group is defined in the regex. "[^"]*"|'[^']*' part matches but does not capture strings between single and double quotes, and the \b([^\d\W]+)\b part matches and captures into Group 1 any one or more letters or underscores in between word boundaries.

Regex matching closing bracket not in quotes

You can use an expression like this:

(?<! \$ )                     # not preceded by $
\$ (?: \$\$ )? # $ or $$$
\( # opening (

(?> # non-backtracking atomic group
(?> # non-backtracking atomic group
[^"'()]+ # literals, spaces, etc
| " (?: [^"\\]+ | \\. )* " # double quoted string with escapes
| ' (?: [^'\\]+ | \\. )* ' # single quoted string with escapes
| (?<open> \( ) # open += 1
| (?<close-open> \) ) # open -= 1, only if open > 0 (balancing group)
)*
)

(?(open) (?!) ) # fail if open > 0

\) # final )

Which can be quoted as above. For example in C#:

var regex = new Regex(@"(?x)    # enable eXtended mode (ignore spaces, comments)
(?<! \$ ) # not preceded by $
\$ (?: \$\$ ) # $ or $$$
\( # opening (

(?> # non-backtracking atomic group
(?> # non-backtracking atomic group
[^""'()]+ # literals, spaces, etc
| "" (?: [^""\\]+ | \\. )* "" # double quoted string with escapes
| ' (?: [^'\\]+ | \\. )* ' # single quoted string with escapes
| (?<open> \( ) # open += 1
| (?<close-open> \) ) # open -= 1, only if open > 0 (balancing group)
)*
)

(?(open) (?!) ) # fail if open > 0

\) # final )
");

Regex to match all except a string in quotes in C#

Try the following RegEx (Edit: fixed).

(?:[^\"]|(?:(?:.*?\"){2})*?)(?: |^)(?<kw>for|while|if)[ (]

Note: Because this RegEx literal includes quotes, you can't use the @ sign before the string. Remember that if you add any RegEx special chars to the string, you'll need to double-escape them appropiatlye (e.g. \w). Insure that you also specify the Multiline parameter when matching with the RegEx, so the caret (^) is treated as the start of a new line.

This hasn't been tested, but should do the job. Let me know if there's any problems. Also, depending on what more you want to do here, I might recommend using standard text-parsing (non-RegEx), as it will quickly become more readable depending on how much data you want to extract from the code. Hope that helps anyway.

Edit:
Here's some example code, which I've tested and am pretty confident that it works as intended.

var input = "while t < 10 loop\n s => 'this is if stmt'; for u in 8..12 loop \n}"; 
var pattern = "(?:[^\"]|(?:(?:.*?\"){2})*?)(?: |^)(?<kw>for|while|if)[ (]";
var matches = Regex.Matches(input, pattern);
var firstKeyword = matches[0].Groups["kw"].Value;
// The following line is a one-line solution for .NET 3.5/C# 3.0 to get an array of all found keywords.
var keywords = matches.Cast<Match>().Select(match => match.Groups["kw"].Value).ToArray();

Hopefully this should be your complete solution now...

RegEx: Grabbing values between quotation marks

I've been using the following with great success:

(["'])(?:(?=(\\?))\2.)*?\1

It supports nested quotes as well.

For those who want a deeper explanation of how this works, here's an explanation from user ephemient:

([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.



Related Topics



Leave a reply



Submit