Confused about backslashes in regular expressions
The confusion is due to the fact that the backslash character \
is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \
before the re
module ever sees your string. For instance, \n
is converted to a newline character, \t
is converted to a tab character, etc. To get an actual \
character, you can escape it as well, so \\
gives a single \
character. If the character following the \
isn't a recognized escape character, then the \
is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \
characters by doubling them, i.e. \\
.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s
is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \
escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \
escapes.
Once you know how your string is being encoded, you can then think about what the re
module will do with it. For instance, if you want to escape \
in a string you pass to the re
module, you will need to pass \\
to re
, which means you will need to use \\\\
in your quoted Python string. The Python string will end up with \\
and the re
module will treat this as a single literal \
character.
An alternative way to include \
characters in Python strings is to use raw strings, e.g. r'a\b'
is equivalent to "a\\b"
.
Can't escape the backslash with regex?
If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).
For instance:
regex("\\\\")
is interpreted as...
regex("\\" [escaped backslash] followed by "\\" [escaped backslash])
is interpreted as...
regex(\\)
is interpreted as a regex that matches a single backslash.
Depending on the language, you might be able to use a different form of quoting that doesn't parse escape sequences to avoid having to use as many - for instance, in Python:
re.compile(r'\\')
The r
in front of the quotes makes it a raw string which doesn't parse backslash escapes.
The backslash character in Regex for Python
Python only recognizes some sequences starting with \
as escape sequences. For example \d
is not a known escape sequence so for this particular case there is no need to escape the backslah to keep it there.
(In Python 3.6) "\d"
and "\\d"
are equivalent:
>>> "\d" == "\\d"
True
>>> r"\d" == "\\d"
True
Here is a list of all the recognized escape sequences: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Regex Python - Backslash
Must be:
re.sub('\\\\[A-Za-z]+',' ',text)
Otherwise, '\\'
is treated as a regex special escape character.
Confused about regular expression
Your regex is equivalent to \W*
. It matches 0 or more non-alphanumeric characters.
Actually, you are using python string literal, instead of raw string. In a python string literal, to match a literal backslash, you need to escape the backslash - \\
, as a backslash has a special meaning there. And then for regex, you need to escape both the backslashes, to make it - \\\\
.
So, to match \
followed by 0 or more W
, you would need \\\\W*
in a string literal. You can simplify this by using a raw string. Where a \\
will match a literal \
. That's because, backslashes are not handled in any special way when used inside a raw string.
The below example will help you understand this:
>>> s = "\WWWW$$$$"
# Without raw string
>>> splitter = re.compile('\\W*') # Match non-alphanumeric characters
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']
>>> splitter = re.compile('\\\\W*') # Match `\` followed by 0 or more `W`
>>> re.findall(splitter, s)
['\\WWWW']
# With raw string
>>> splitter = re.compile(r'\W*') # Same as first one. You need a single `\`
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']
>>> splitter = re.compile(r'\\W*') # Same as 2nd. Two `\\` needed.
>>> re.findall(splitter, s)
['\\WWWW']
Why are double backslashes for word boundary regular expressions, but single backslash works for other expressions?
\b
has a special meaning: it's the backspace character. \s
and many other \<something>
sequences have no special meaning so \
is then interpreted as a literal slash.
>>> '\b'
'\x08'
>>> print('123\b456')
12456
>>> '\s'
'\\s'
>>> print('\s')
\s
>>> print('\b')
>>> # nothing visible printed above
To make things easier, you should usually use raw string literals when writing regexes. This generally prevents \
from being interpreted as an escape character in the Python string sense, so that it works properly in your regex. For example:
>>> r'\b'
'\\b'
>>> print(r'\b')
\b
Python re.split() escaping backslash
Your first backslash is escaping the second at the level of the string literal. But the regex engine needs that backslash escaped also, since it's a special character for regex too.
Use a "raw" string literal (e.g. r' |/|\\'
) or quadruple backslash.
How to quote backslash in Python code (four \ to quote one)?
You should only need to escape something once if you specify it to be a raw r
string.
regex = r"C:\\ghs\\comp_201416\\([a-z]*)\.exe"
\
is escaped once, so it looks like \\
, for .exe
only .
needs escaping, so \.
Related Topics
How to Use Chrome Profile in Selenium Webdriver Python 3
Correct Way to Define Python Source Code Encoding
Catch a Thread's Exception in the Caller Thread
Multiprocessing.Pool: When to Use Apply, Apply_Async or Map
Non-Alphanumeric List Order from Os.Listdir()
Rewrite Multiple Lines in the Console
Should You Always Favor Xrange() Over Range()
How to .Decode('String-Escape') in Python 3
Convert Django Model Object to Dict with All of the Fields Intact
Pandas Long to Wide Reshape, by Two Variables
Modifying a Python Dict While Iterating Over It
Error "Filename.Whl Is Not a Supported Wheel on This Platform"