Escaping Special Characters in Java Regular Expressions

Escaping special characters in Java Regular Expressions

Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?

If you are looking for a way to create constants that you can use in your regex patterns, then just prepending them with "\\" should work but there is no nice Pattern.escape('.') function to help with this.

So if you are trying to match "\\d" (the string \d instead of a decimal character) then you would do:

// this will match on \d as opposed to a decimal character
String matchBackslashD = "\\\\d";
// as opposed to
String matchDecimalDigit = "\\d";

The 4 slashes in the Java string turn into 2 slashes in the regex pattern. 2 backslashes in a regex pattern matches the backslash itself. Prepending any special character with backslash turns it into a normal character instead of a special one.

matchPeriod = "\\.";
matchPlus = "\\+";
matchParens = "\\(\\)";
...

In your post you use the Pattern.quote(string) method. This method wraps your pattern between "\\Q" and "\\E" so you can match a string even if it happens to have a special regex character in it (+, ., \\d, etc.)

List of all special characters that need to be escaped in a regex

You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

You need to escape any char listed there if you want the regular char and not the special meaning.

As a maybe simpler solution, you can put the template between \Q and \E - everything between them is considered as escaped.

Java Regular Expression special character escape

You only need to escape ^ when you want to match it literally, that is, you want to look for text containing the ^ character.

If you intend to use the ^ with its special meaning (the start of a line/string) then there is no need to escape it. Simply type

"^[a-zA-Z0-9!~`@#$%\\^]"

in your source code. The backslashes towards the end of this regular expression do not matter. You need to type 2 backslashes because of the special meaning of the backslash in Java but that has nothing to do with its treatment regular expressions. The regular expression engine receives a single backslash which it uses to read the following character as literal but ^ is a literal within brackets anyway.

To elaborate on your comment about [ and ]:

The brackets have a special meaning in regular expressions as they basically form the boundaries of the character list given by a pattern (the mentioned characters form a so called character class). Let's decompose the regular expression from above to make things clear.

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class

The order of patterns within the character class definition is irrelevant.
The expression above matches matches if the first character of the examined text is part of your character class definition. It depends on how you use the regular expression if the other characters in the examined text matter.

When you start with regular expressions you should always use multiple test texts to match a against and verify the behaviour. It is also advisable to make these test cases a unit test to get high confidence of the correct behaviour of your program.

A simple code sample to test the expression is as follows:

public class Test {
public static void main(String[] args) {
String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
String[] testdata = new String[] {
"abc",
"2332",
"some@test",
"test [ and ] test end",
// Following sample will not match the pattern.
"äöüßµøł"
};
for (String toExamine : testdata) {
if (toExamine.matches(regexp)) {
System.out.println("Match: " + toExamine);
} else {
System.out.println("No match: " + toExamine);
}
}
}
}

Note the I use a modified pattern here. It ensures all characters in the examined string are matching your character class. I did extend the character class to allow for a \ and space and [ and ].
The decomposed description is:

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text

One thing I don't get though is why one would use the characters of American keyboards as criteria for validation.

Java Regex double backslash escaping special characters

Use 4 backslashes:

Pattern.compile("((([a-zA-Z0-9])([a-zA-Z0-9 ]*)\\\\?)+)")
^^^^
  1. You need to match a backslash char: \.
  2. A backslash is a special char for regexps (used for predefined classes such as \d for example), which needs to be escaped by another backslash: \\.
  3. As Java uses string literals for regexps, and a backslash also is a special char for string literals (used for the line feed char \n for example), each backslash needs to be escaped by another backslash: \\\\.

How to escape special characters in a regex pattern in java?

You can use this regex with a capturing group:

String myString = "Patient:\n${ss.patient.howard.firstName} ${ss.patient.howard.lastName}\nGender: ${ss.patient.howard.sex}\nBirthdate: ${ss.patient.howard.dob}\n${ss.patient.howard.addressLine1}\nPhone: (801)546-4765";
myString = myString.replaceAll("\\$\\{[^}]+?\\.([^.}]+)}", "$1");

System.err.println(myString);

([^.}]+) is the capturing group before } and after the last DOT.

RegEx Demo

Output:

Patient:
firstName lastName
Gender: sex
Birthdate: dob
addressLine1
Phone: (801)546-4765

Regex pattern including all special characters

Please don't do that... little Unicode BABY ANGELs like this one are dying! ◕◡◕ (← these are not images) (nor is the arrow!)

And you are killing 20 years of DOS :-) (the last smiley is called WHITE SMILING FACE... Now it's at 263A... But in ancient times it was ALT-1)

and his friend

BLACK SMILING FACE... Now it's at 263B... But in ancient times it was ALT-2

Try a negative match:

Pattern regex = Pattern.compile("[^A-Za-z0-9]");

(this will ok only A-Z "standard" letters and "standard" 0-9 digits.)

How to escape [] chars in regular expressions

You escape special characters with \. Note that \ is itself a special character. So something like

.map(l -> l.replaceAll("[,.!?\\[\\]:;]", "")

Skip some special character

Yes, you could do this with a Pattern and a regular expression. Like,

// Note that the literal [](s) have to be escaped below,
String specialCharacters = "[!#$%&'()*+,.:;=?@\\[\\]^`{|}~]";
String val = "a{b}c";
Pattern p = Pattern.compile(specialCharacters);
System.out.println(p.matcher(val).replaceAll(""));

Which outputs

abc


Related Topics



Leave a reply



Submit