Regex: How to Escape Backslashes and Special Characters

Can't escape the backslash with regex?

If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).

For instance:

regex("\\\\")

is interpreted as...

regex("\\" [escaped backslash] followed by "\\" [escaped backslash])

is interpreted as...

regex(\\)

is interpreted as a regex that matches a single backslash.


Depending on the language, you might be able to use a different form of quoting that doesn't parse escape sequences to avoid having to use as many - for instance, in Python:

re.compile(r'\\')

The r in front of the quotes makes it a raw string which doesn't parse backslash escapes.

Java Regex double backslash escaping special characters

Use 4 backslashes:

Pattern.compile("((([a-zA-Z0-9])([a-zA-Z0-9 ]*)\\\\?)+)")
^^^^
  1. You need to match a backslash char: \.
  2. A backslash is a special char for regexps (used for predefined classes such as \d for example), which needs to be escaped by another backslash: \\.
  3. As Java uses string literals for regexps, and a backslash also is a special char for string literals (used for the line feed char \n for example), each backslash needs to be escaped by another backslash: \\\\.

regex: How to escape backslashes and special characters?

You just need to replace all single backslashes with double backslashes. This is complicated a bit since the replaceAll function on String really executes a regular expression and you have to first escape the backslash because it's a literal (yielding \\), and then escape it again because of the regular expression (yielding \\\\). The replacement suffers a similar fate and requires two such escape sequences making it a total of 8 backslashes:

System.out.printf("%s ~= %s ? %s  %n", 
args[0].replaceAll("\\\\","\\\\\\\\"), args[1], ...

std::regex escape special characters for use in regex

File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.

Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).

Fortunately, we can solve our regular expression problem with more regular expressions:

std::string EscapeForRegularExpression(const std::string &s) {
static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*)");
return std::regex_replace(s, metacharacters, "\\$&");
}

(File paths can't contain * or ?, but I've included them to keep the function general.)

If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:

std::string EscapeForRegularExpression(const std::string &s) {
static const char metacharacters[] = R"(\.^$-+()[]{}|?*)";
std::string out;
out.reserve(s.size());
for (auto ch : s) {
if (std::strchr(metacharacters, ch))
out.push_back('\\');
out.push_back(ch);
}
return out;
}

Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.

RegEx for escaping special characters

My doubt is [ \ ^ $ . | ? * + ( ) all these need to be escaped before passing new RegExp() or only (backslashes \) alone need to be escaped. Which one need to be escaped or not be escaped is not clear to me?

Your question is answered right at the start of the document section you refer to. Read that again:

If you need to use any of the special characters literally (actually searching for a '*', for instance), you must escape it …

Conversely, if you need any of the special characters to have its special meaning, you must not escape it.

Besides the above, any backslash which is to be placed in the string has to be doubled if assigned from a string literal.

Append Special Character with Double black Slash [\\] in String Java

Note that you may directly use String#replaceAll to match and replace multiple substrings with a regex of your choice. Also, String#replace does not accept regex, so your c=c.replace("\\+", "\\\\"+"+"); would not work.

You may use

String c = "edX-NYIF+CR.5x";
System.out.println(c.replaceAll("[^a-zA-Z0-9]", "\\\\\\\\$0"));

See the Java online demo

The [^a-zA-Z0-9] (or "\\P{Alnum}") will match any char but a letter or digit, and then "\\\\\\\\$0" (=\\\\$0 literal string) will replace the match with itself prepended with 2 literal backslashes. Note that a literal backslash (that is specified in a Java string literal using two consecutive backslashes) is a special regex escape char that must be doubled to specify a single backslash that will be put in the resulting string.

If you are confused with backslashes and in fact want to get single (not double) backslashes in the output, remove 4 backslashes from the replacement pattern, .replaceAll("[^a-zA-Z0-9]", "\\\\$0").

Java Regular Expression special character escape

You only need to escape ^ when you want to match it literally, that is, you want to look for text containing the ^ character.

If you intend to use the ^ with its special meaning (the start of a line/string) then there is no need to escape it. Simply type

"^[a-zA-Z0-9!~`@#$%\\^]"

in your source code. The backslashes towards the end of this regular expression do not matter. You need to type 2 backslashes because of the special meaning of the backslash in Java but that has nothing to do with its treatment regular expressions. The regular expression engine receives a single backslash which it uses to read the following character as literal but ^ is a literal within brackets anyway.

To elaborate on your comment about [ and ]:

The brackets have a special meaning in regular expressions as they basically form the boundaries of the character list given by a pattern (the mentioned characters form a so called character class). Let's decompose the regular expression from above to make things clear.

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class

The order of patterns within the character class definition is irrelevant.
The expression above matches matches if the first character of the examined text is part of your character class definition. It depends on how you use the regular expression if the other characters in the examined text matter.

When you start with regular expressions you should always use multiple test texts to match a against and verify the behaviour. It is also advisable to make these test cases a unit test to get high confidence of the correct behaviour of your program.

A simple code sample to test the expression is as follows:

public class Test {
public static void main(String[] args) {
String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
String[] testdata = new String[] {
"abc",
"2332",
"some@test",
"test [ and ] test end",
// Following sample will not match the pattern.
"äöüßµøł"
};
for (String toExamine : testdata) {
if (toExamine.matches(regexp)) {
System.out.println("Match: " + toExamine);
} else {
System.out.println("No match: " + toExamine);
}
}
}
}

Note the I use a modified pattern here. It ensures all characters in the examined string are matching your character class. I did extend the character class to allow for a \ and space and [ and ].
The decomposed description is:

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text

One thing I don't get though is why one would use the characters of American keyboards as criteria for validation.



Related Topics



Leave a reply



Submit