Python Regex escape operator \ in substitutions & raw strings
First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \
. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub
docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n
, \r
) and regex escape sequences (\6
) and those that can be used as both regex and string escape sequences (\&
).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X'
or r'\X'
, and a string escape sequence to denote a sequence of \
and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape "
(and that is the reason why you can't end a raw string literal with \"
, but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1'
and '\\1'
is the same replacement pattern, \1
. If you use '\1'
, it will get parse as a string escape sequence, a character with octal value 001
. If you forget to use r
prefix with the unambiguous backreference, there is no problem because \g
is not a valid string escape sequence, and there, \
escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.'
as a replacement string, you actually send \.
two-char combination as the replacement string, and that is why you get \.
in the result.
\
is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text)
, you will get the same result as in text2
and text3
cases, see this demo.
That happens because \\
, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2'
in the replacement to actually replace with \
and 2
char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)
Escape special characters in a Python string
Use re.escape
>>> import re
>>> re.escape(r'\ a.*$')
'\\\\\\ a\\.\\*\\$'
>>> print(re.escape(r'\ a.*$'))
\\\ a\.\*\$
>>> re.escape('www.stackoverflow.com')
'www\\.stackoverflow\\.com'
>>> print(re.escape('www.stackoverflow.com'))
www\.stackoverflow\.com
Repeating it here:
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
As of Python 3.7 re.escape()
was changed to escape only characters which are meaningful to regex operations.
understanding raw string for regular expressions in python
Don't double the backslash when using raw string:
>>> pattern3 = r'\n\n'
>>> pattern3
'\\n\\n'
>>> re.findall(pattern3, text)
['\n\n']
Python Regex escape operator \ in substitutions & raw strings
First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \
. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub
docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n
, \r
) and regex escape sequences (\6
) and those that can be used as both regex and string escape sequences (\&
).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X'
or r'\X'
, and a string escape sequence to denote a sequence of \
and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape "
(and that is the reason why you can't end a raw string literal with \"
, but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1'
and '\\1'
is the same replacement pattern, \1
. If you use '\1'
, it will get parse as a string escape sequence, a character with octal value 001
. If you forget to use r
prefix with the unambiguous backreference, there is no problem because \g
is not a valid string escape sequence, and there, \
escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.'
as a replacement string, you actually send \.
two-char combination as the replacement string, and that is why you get \.
in the result.
\
is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text)
, you will get the same result as in text2
and text3
cases, see this demo.
That happens because \\
, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2'
in the replacement to actually replace with \
and 2
char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)
Add a backslash to all non-word characters in a string
Your main problem is that string escapes are taking effect before the regex substitution escapes. Switching to raw strings (to inhibit string escapes) and escaping your backslash (because \\
is itself a substitution escape) will fix this:
>>> print(re.sub(r'(\W)', r'\\\1', '?:n.io/search?query=title++sub'))
\?\:n\.io\/search\?query\=title\+\+sub
Note that you may not need such extensive escaping. If you just want to escape regex special characters, re.escape
will do this for you:
>>> print(re.escape('?:n.io/search?query=title++sub'))
\?:n\.io/search\?query=title\+\+sub
without adding unnecessary escapes (ones that aren't needed to despecialize regex characters).
What does 'r' mean before a Regex pattern?
Placing r
or R
before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences (\n
, \b
, etc.) and are thus commonly used for Regex patterns, which often contain a lot of \
characters.
Below is a demonstration:
>>> print('\n') # Prints a newline character
>>> print(r'\n') # Escape sequence is not processed
\n
>>> print('\b') # Prints a backspace character
>>> print(r'\b') # Escape sequence is not processed
\b
>>>
The only other option would be to double every backslash:
re.sub('def\\s+([a-zA-Z_][a-zA-Z_0-9]*)\\s*\\(\\s*\\):',
... 'static PyObject*\\npy_\\1(void)\\n{',
... 'def myfunc():')
which is just tedious.
The way to unescape escaped regex pattern Python
Based on you update I'm pretty sure that you would get exactly your desired output if you just stopped trying to unescape it.
import re
s1 = "1234astring"
matches = re.search("\\d{4}", s1)
matches.group(0)
"1234"
matches.group()[0]
"1"
Related Topics
Unnamed Python Objects Have the Same Id
Convert Row to Column Header for Pandas Dataframe,
Python Postgres Psycopg2 Threadedconnectionpool Exhausted
Multiprocessing in Python - Sharing Large Object (E.G. Pandas Dataframe) Between Multiple Processes
Retrieving a Foreign Key Value with Django-Rest-Framework Serializers
Find P-Value (Significance) in Scikit-Learn Linearregression
Getting Individual Colors from a Color Map in Matplotlib
Get Raw Post Body in Python Flask Regardless of Content-Type Header
Python Multiprocessing: Handling Child Errors in Parent
Create a Day-Of-Week Column in a Pandas Dataframe Using Python
Explicitly Select Items from a List or Tuple
Add Sum of Values of Two Lists into New List
How to Apply Gradient Clipping in Tensorflow