Confused About Backslashes in Regular Expressions

Confused about backslashes in regular expressions

The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.

If you want to see how Python is expanding your string escapes, just print out the string. For example:

s = 'a\\b\tc'
print(s)

If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.

Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.

An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".

Can't escape the backslash with regex?

If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).

For instance:

regex("\\\\")

is interpreted as...

regex("\\" [escaped backslash] followed by "\\" [escaped backslash])

is interpreted as...

regex(\\)

is interpreted as a regex that matches a single backslash.


Depending on the language, you might be able to use a different form of quoting that doesn't parse escape sequences to avoid having to use as many - for instance, in Python:

re.compile(r'\\')

The r in front of the quotes makes it a raw string which doesn't parse backslash escapes.

The backslash character in Regex for Python

Python only recognizes some sequences starting with \ as escape sequences. For example \d is not a known escape sequence so for this particular case there is no need to escape the backslah to keep it there.

(In Python 3.6) "\d" and "\\d" are equivalent:

>>> "\d" == "\\d"
True
>>> r"\d" == "\\d"
True

Here is a list of all the recognized escape sequences: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Regex Python - Backslash

Must be:

re.sub('\\\\[A-Za-z]+',' ',text)

Otherwise, '\\' is treated as a regex special escape character.

Confused about regular expression

Your regex is equivalent to \W*. It matches 0 or more non-alphanumeric characters.

Actually, you are using python string literal, instead of raw string. In a python string literal, to match a literal backslash, you need to escape the backslash - \\, as a backslash has a special meaning there. And then for regex, you need to escape both the backslashes, to make it - \\\\.

So, to match \ followed by 0 or more W, you would need \\\\W* in a string literal. You can simplify this by using a raw string. Where a \\ will match a literal \. That's because, backslashes are not handled in any special way when used inside a raw string.

The below example will help you understand this:

>>> s = "\WWWW$$$$"

# Without raw string
>>> splitter = re.compile('\\W*') # Match non-alphanumeric characters
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']

>>> splitter = re.compile('\\\\W*') # Match `\` followed by 0 or more `W`
>>> re.findall(splitter, s)
['\\WWWW']

# With raw string
>>> splitter = re.compile(r'\W*') # Same as first one. You need a single `\`
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']

>>> splitter = re.compile(r'\\W*') # Same as 2nd. Two `\\` needed.
>>> re.findall(splitter, s)
['\\WWWW']

Why are double backslashes for word boundary regular expressions, but single backslash works for other expressions?

\b has a special meaning: it's the backspace character. \s and many other \<something> sequences have no special meaning so \ is then interpreted as a literal slash.

>>> '\b'
'\x08'
>>> print('123\b456')
12456
>>> '\s'
'\\s'
>>> print('\s')
\s
>>> print('\b')

>>> # nothing visible printed above

To make things easier, you should usually use raw string literals when writing regexes. This generally prevents \ from being interpreted as an escape character in the Python string sense, so that it works properly in your regex. For example:

>>> r'\b'
'\\b'
>>> print(r'\b')
\b

Python re.split() escaping backslash

Your first backslash is escaping the second at the level of the string literal. But the regex engine needs that backslash escaped also, since it's a special character for regex too.

Use a "raw" string literal (e.g. r' |/|\\') or quadruple backslash.

How to quote backslash in Python code (four \ to quote one)?

You should only need to escape something once if you specify it to be a raw r string.

regex = r"C:\\ghs\\comp_201416\\([a-z]*)\.exe"

\ is escaped once, so it looks like \\, for .exe only . needs escaping, so \.



Related Topics



Leave a reply



Submit