Escaping Regex String

Is there a RegExp.escape function in JavaScript?

The function linked in another answer is insufficient. It fails to escape ^ or $ (start and end of string), or -, which in a character group is used for ranges.

Use this function:

function escapeRegex(string) {
return string.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
}

While it may seem unnecessary at first glance, escaping - (as well as ^) makes the function suitable for escaping characters to be inserted into a character class as well as the body of the regex.

Escaping / makes the function suitable for escaping characters to be used in a JavaScript regex literal for later evaluation.

As there is no downside to escaping either of them, it makes sense to escape to cover wider use cases.

And yes, it is a disappointing failing that this is not part of standard JavaScript.

What does `escape a string` mean in Regex? (Javascript)

Many characters in regular expressions have special meanings. For instance, the dot character '.' means "any one character". There are a great deal of these specially-defined characters, and sometimes, you want to search for one, not use its special meaning.

See this example to search for any filename that contains a '.':

/^[^.]+\..+/

In the example, there are 3 dots, but our description says that we're only looking for one. Let's break it down by the dots:

  • Dot #1 is used inside a "character class" (the characters inside the square brackets), which tells the regex engine to search for "any one character" that is not a '.', and the "+" says to keep going until there are no more characters or the next character is the '.' that we're looking for.
  • Dot #2 is preceded by a backslash, which says that we're looking for a literal '.' in the string (without the backslash, it would be using its special meaning, which is looking for "any one character"). This dot is said to be "escaped", because it's special meaning is not being used in this context - the backslash immediately before it made that happen.
  • Dot #3 is simply looking for "any one character" again, and the '+' following it says to keep doing that until it runs out of characters.

So, the backslash is used to "escape" the character immediately following it; as such, it's called the "escape character". That just means that the character's special meaning is taken away in that one place.

Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".

Does regex syntax provide a way of escaping a whole string, rather than escaping characters one by one?

It appears that regex does not offer such a syntax, at least for Javascript. It is possible, however, to proceed as follows:

  1. use a string and automatically escape all the special characters in it,
    as indicated here: Javascript regular expression - string to RegEx object
  2. concatenate that string with strings representing the rest of the expression you want to create
  3. transform the string into a regex expression as indicated in Escape string for use in Javascript regex .

Escaping regex string

Use the re.escape() function for this:

4.2.3 re Module Contents

escape(string)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.

def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)

Python Regex escape operator \ in substitutions & raw strings

First and foremost,

replacement patterns ≠ regular expression patterns

We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.

NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.

Replacement pattern syntax in Python

The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).

I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).

So, in a replacement pattern, you may use backreferences:

re.sub(r'\D(\d)\D', r'\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1

You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.

\ is a special character in Python replacement pattern

If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.

That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.

Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:

re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

Escaping the Escape Character in Regex not working

Your regex is correct. Use raw string r before regex and it will work fine

re.compile(r'[\w!\-$%^&*()_+|~=`{}\[\]:";\'<>?,.@#\\/]+')

Check



Related Topics



Leave a reply



Submit