How to Escape All Characters with Special Meaning in Regex

What does `escape a string` mean in Regex? (Javascript)

Many characters in regular expressions have special meanings. For instance, the dot character '.' means "any one character". There are a great deal of these specially-defined characters, and sometimes, you want to search for one, not use its special meaning.

See this example to search for any filename that contains a '.':

/^[^.]+\..+/

In the example, there are 3 dots, but our description says that we're only looking for one. Let's break it down by the dots:

  • Dot #1 is used inside a "character class" (the characters inside the square brackets), which tells the regex engine to search for "any one character" that is not a '.', and the "+" says to keep going until there are no more characters or the next character is the '.' that we're looking for.
  • Dot #2 is preceded by a backslash, which says that we're looking for a literal '.' in the string (without the backslash, it would be using its special meaning, which is looking for "any one character"). This dot is said to be "escaped", because it's special meaning is not being used in this context - the backslash immediately before it made that happen.
  • Dot #3 is simply looking for "any one character" again, and the '+' following it says to keep doing that until it runs out of characters.

So, the backslash is used to "escape" the character immediately following it; as such, it's called the "escape character". That just means that the character's special meaning is taken away in that one place.

Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".

What special characters must be escaped in regular expressions?

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{\|

and these inside character classes:

^-]\

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{\|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*[\

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

RegEx for escaping special characters

My doubt is [ \ ^ $ . | ? * + ( ) all these need to be escaped before passing new RegExp() or only (backslashes \) alone need to be escaped. Which one need to be escaped or not be escaped is not clear to me?

Your question is answered right at the start of the document section you refer to. Read that again:

If you need to use any of the special characters literally (actually searching for a '*', for instance), you must escape it …

Conversely, if you need any of the special characters to have its special meaning, you must not escape it.

Besides the above, any backslash which is to be placed in the string has to be doubled if assigned from a string literal.

Java Regular Expression special character escape

You only need to escape ^ when you want to match it literally, that is, you want to look for text containing the ^ character.

If you intend to use the ^ with its special meaning (the start of a line/string) then there is no need to escape it. Simply type

"^[a-zA-Z0-9!~`@#$%\\^]"

in your source code. The backslashes towards the end of this regular expression do not matter. You need to type 2 backslashes because of the special meaning of the backslash in Java but that has nothing to do with its treatment regular expressions. The regular expression engine receives a single backslash which it uses to read the following character as literal but ^ is a literal within brackets anyway.

To elaborate on your comment about [ and ]:

The brackets have a special meaning in regular expressions as they basically form the boundaries of the character list given by a pattern (the mentioned characters form a so called character class). Let's decompose the regular expression from above to make things clear.

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class

The order of patterns within the character class definition is irrelevant.
The expression above matches matches if the first character of the examined text is part of your character class definition. It depends on how you use the regular expression if the other characters in the examined text matter.

When you start with regular expressions you should always use multiple test texts to match a against and verify the behaviour. It is also advisable to make these test cases a unit test to get high confidence of the correct behaviour of your program.

A simple code sample to test the expression is as follows:

public class Test {
public static void main(String[] args) {
String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
String[] testdata = new String[] {
"abc",
"2332",
"some@test",
"test [ and ] test end",
// Following sample will not match the pattern.
"äöüßµøł"
};
for (String toExamine : testdata) {
if (toExamine.matches(regexp)) {
System.out.println("Match: " + toExamine);
} else {
System.out.println("No match: " + toExamine);
}
}
}
}

Note the I use a modified pattern here. It ensures all characters in the examined string are matching your character class. I did extend the character class to allow for a \ and space and [ and ].
The decomposed description is:

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text

One thing I don't get though is why one would use the characters of American keyboards as criteria for validation.

std::regex escape special characters for use in regex

File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.

Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).

Fortunately, we can solve our regular expression problem with more regular expressions:

std::string EscapeForRegularExpression(const std::string &s) {
static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*)");
return std::regex_replace(s, metacharacters, "\\$&");
}

(File paths can't contain * or ?, but I've included them to keep the function general.)

If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:

std::string EscapeForRegularExpression(const std::string &s) {
static const char metacharacters[] = R"(\.^$-+()[]{}|?*)";
std::string out;
out.reserve(s.size());
for (auto ch : s) {
if (std::strchr(metacharacters, ch))
out.push_back('\\');
out.push_back(ch);
}
return out;
}

Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.



Related Topics



Leave a reply



Submit