What Literal Characters Should Be Escaped in a Regex

What special characters must be escaped in regular expressions?

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{\|

and these inside character classes:

^-]\

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{\|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*[\

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

What literal characters should be escaped in a regex?

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • \ (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([\w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [\^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]

List of all characters that should be escaped before put in to RegEx?

Take a look at PHP.JS's implementation of PHP's preg_quote function, that should do what you need:

http://phpjs.org/functions/preg_quote:491

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

What characters need to be escaped in .NET Regex?

I don't know the complete set of characters - but I wouldn't rely on the knowledge anyway, and I wouldn't put it into code. Instead, I would use Regex.Escape whenever I wanted some literal text that I wasn't sure about:

// Don't actually do this to check containment... it's just a little example.
public bool RegexContains(string haystack, string needle)
{
Regex regex = new Regex("^.*" + Regex.Escape(needle) + ".*$");
return regex.IsMatch(haystack);
}

When do I need to escape characters within a regex character set (within [])?

The only thing that needs to be escaped in brackets is a closing bracket, and a minus if it is not initial or final, and a hat if it is initial, AFAIK. And the backslash itself, obviously.

The reason is, these are the only characters with a special significance inside the brackets. A closing bracket ends the brackets, a mid-string minus indicates a range, and an initial hat negates the bracket class. Everything else should be literally interpreted. The backslash is the escape character, so you need a double backslash to match a literal backslash.

Does a dot have to be escaped in a character class (square brackets) of a regular expression?

In a character class (square brackets) any character except ^, -, ] or \ is a literal.

This website is a brilliant reference and has lots of info on the nuances of different regex flavours.
http://www.regular-expressions.info/refcharclass.html

List of all special characters that need to be escaped in a regex

You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

You need to escape any char listed there if you want the regular char and not the special meaning.

As a maybe simpler solution, you can put the template between \Q and \E - everything between them is considered as escaped.

Which symbols should be escaped with a backslash in php regex?

There is a list of special Regex characters in the PHP documentation here: http://php.net/manual/en/function.preg-quote.php

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

Why do regexes and string literals use different escape sequences?

The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so

my_string = 'x string'

But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character

my_string = 'x's string' # this will cause bugs
my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string

I think that most programing languages have the same set of escape sequences for string literals.

Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.

For example

regex_string = 'A.C'  # match an A, followed by any character, followed by C
regex_string = 'A\.C' # match an A, followed by a period, followed by C
regex_string = 'AsC' # match an A, followed by s, followed by C
regex_string = 'A\sC' # match an A, followed by a space character, followed by C

Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.



Related Topics



Leave a reply



Submit