Python Regex - R Prefix

What does the r in pythons re.compile(r' pattern flags') mean?

As @PauloBu stated, the r string prefix is not specifically related to regex's, but to strings generally in Python.

Normal strings use the backslash character as an escape character for special characters (like newlines):

>>> print('this is \n a test')
this is
a test

The r prefix tells the interpreter not to do this:

>>> print(r'this is \n a test')
this is \n a test
>>>

This is important in regular expressions, as you need the backslash to make it to the re module intact - in particular, \b matches empty string specifically at the start and end of a word. re expects the string \b, however normal string interpretation '\b' is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'), or tell python it is a raw string (r'\b').

>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']

Python Regex escape operator \ in substitutions & raw strings

First and foremost,

replacement patterns ≠ regular expression patterns

We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.

NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.

Replacement pattern syntax in Python

The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).

I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).

So, in a replacement pattern, you may use backreferences:

re.sub(r'\D(\d)\D', r'\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1

You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.

\ is a special character in Python replacement pattern

If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.

That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.

Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:

re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.

Quoting from the Python Regular Expression HOWTO:

The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:

>>> len('\.')
2
>>> len(r'\.')
2

Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:

>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2

The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.

The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

Python regular expression r prefix followed by three single (or double) quotes

If your pattern is surrounded by triple quotes, it won't need escaping of quotes present inside the regex.

Simple one,

r'''foo"'b'a'r"buzz'''

tough one which needs escaping.

r'foo"\'b\'a\'r"buzz'

This would be more helpful if your regex contain n number of quotes.

Why does Regex raw string prefix r not work as expected?

The reason that "\d+" works is that "\d" is not a proper escape sequence in Python strings and Python simply treats it as a backslash followed by a "d" instead of producing a syntax error.

So "\d", "\\d" and r"\d" are all equivalent and represent a string containing one backslash and one d. The regex engine than sees this backslash + "d" and interprets it as "match any digit".

"\\\d", "\\\\d" and r"\\d", on the other hand, all contain two backslashes followed by a "d". This tells the regex engine to match a backslash followed by a "d".

What does 'r' mean before a Regex pattern?

Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences (\n, \b, etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.

Below is a demonstration:

>>> print('\n') # Prints a newline character

>>> print(r'\n') # Escape sequence is not processed
\n
>>> print('\b') # Prints a backspace character

>>> print(r'\b') # Escape sequence is not processed
\b
>>>

The only other option would be to double every backslash:

re.sub('def\\s+([a-zA-Z_][a-zA-Z_0-9]*)\\s*\\(\\s*\\):',
... 'static PyObject*\\npy_\\1(void)\\n{',
... 'def myfunc():')

which is just tedious.



Related Topics



Leave a reply



Submit