What Exactly Is a "Raw String Regex" and How to Use It

What exactly is a raw string regex and how can you use it?

Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.

You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.

In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.

First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.

Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:

s = chr(92)+chr(110)
print len(s), s

2 \n

The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."

s = "\n"
print len(s), s

1
 

(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)

To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:

s = "\\n"
print len(s), s

2 \n

What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:

s = r"\n"
print len(s), s

2 \n

So we have three different string representations, all giving the same string, or sequence of characters:

print chr(92)+chr(110) == "\\n" == r"\n"
True

Now, let's turn to regular expressions. The Python docs, 7.2. reRegular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."

If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:

prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")

So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.

Questions

Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.

s = r"\s\tWord"
prog = re.compile(s)

The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.

Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.

Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.

Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.

N.B. All examples and document references are to Python 2.7.

Update: Incorporated clarifications from answers of @Vladislav Zorov and @m.buettner, and from follow-up question of @Aerovistae.

When to use raw strings in regex patterns?

One another example is sequences like \1, \2 which are octal escapes in Python strings, but reference captured groups in regular expressions.

>>> re.search(r"(\w+) \1", "the the")
<_sre.SRE_Match object; span=(0, 7), match='the the'>
>>> re.search("(\w+) \1", "the the")
>>>

understanding raw string for regular expressions in python

Don't double the backslash when using raw string:

>>> pattern3 = r'\n\n'
>>> pattern3
'\\n\\n'
>>> re.findall(pattern3, text)
['\n\n']

Raw string and regular expression in Python

You're getting confused by the difference between a string and a string literal.

A string literal is what you put between " or ' and the python interpreter parses this string and puts it into memory. If you mark your string literal as a raw string literal (using r') then the python interpreter will not change the representation of that string before putting it into memory but once they've been parsed they are stored exactly the same way.

This means that in memory there is no such thing as a raw string. Both the following strings are stored identically in memory with no concept of whether they were raw or not.

r'a regex digit: \d'  # a regex digit: \d
'a regex digit: \\d' # a regex digit: \d

Both these strings contain \d and there is nothing to say that this came from a raw string. So when you pass this string to the re module it sees that there is a \d and sees it as a digit because the re module does not know that the string came from a raw string literal.

In your specific example, to get a literal backslash followed by a literal d you would use \\d like so:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

Alternatively, without using raw strings:

import re

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text_re = re.sub('(\\d+)/(\\d+)/(\\d+)', '\\3-\\1-\\2', text2)
print (text_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub('(\\\\d+)/(\\\\d+)/(\\\\d+)', '\\3-\\1-\\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

I hope that helps somewhat.

Edit: I didn't want to complicate things but because \d is not a valid escape sequence python does not change it, so '\d' == r'\d' is true. Since \\ is a valid escape sequence it gets changed to \, so you get the behaviour '\d' == '\\d' == r'\d'. Strings get confusing sometimes.

Edit2: To answer your edit, let's look at each line specifically:

text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Hopefully this behaves as you expect now.

text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

Again (because \d is not a valid string escape it doesn't get changed, see my first edit) re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Since \d doesn't get changed by the python interpreter r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'. If you understand my first edit then hopefully you should understand why these two cases behave the same.

text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This case is a bit different because \1, \2 and \3 are all valid escape sequences, they are replaced with the unicode character whose decimal representation is given by the number. That's quite complex but it basically boils down to:

\1  # stands for the ascii start-of-heading character
\2 # stands for the ascii start-of-text character
\3 # stands for the ascii end-of-text character

This means that re.sub receives the first string as it has done in the first two examples ((\d+)/(\d+)/(\d+)) but the second string is actually <start-of-heading>/<start-of-text>/<end-of-text>. So re.sub replaces the match with that second string exactly but since none of the three (\1, \2 or \3) are printable characters python just prints a stock place-holder character instead.

text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This behaves like the third example because r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)', as explained in the second example.

how to indicate raw string with regex() if my pattern come from another string?

You can use the str_to_raw function below to make a raw string out of an already declared plain string variable:

import re
a = 'de la matière condensée'
pattern = '\bconden'

escape_dict = {
'\a': r'\a',
'\b': r'\b',
'\c': r'\c',
'\f': r'\f',
'\n': r'\n',
'\r': r'\r',
'\t': r'\t',
'\v': r'\v',
'\'': r'\'',
'\"': r'\"',
'\0': r'\0',
'\1': r'\1',
'\2': r'\2',
'\3': r'\3',
'\4': r'\4',
'\5': r'\5',
'\6': r'\6',
'\7': r'\7',
'\8': r'\8',
'\9': r'\9'
}

def str_to_raw(s):
return r''.join(escape_dict.get(c, c) for c in s)

print(re.search(r'\bconden', a))
print(re.search(str_to_raw(pattern), a))

Output:

<re.Match object; span=(14, 20), match='conden'>
<re.Match object; span=(14, 20), match='conden'>

note: I got escape_dict from this page.

How to notate raw string regex with a called instance

Raw strings aren't a different data type - they are just an alternative way to write certain strings, making it less complex to express literal string values in your program code. Since regular expressions often contain backslashes, raw strings are frequently used as it avoids the need to write \\ for each backslash.

If you want to match arbitrary text fragments then you probably shouldn't be using regular expressions at all. I'd take a look at the startswith string method, since that just does a character-for-character comparison and is therefore much faster. And there's also the equivalent of re.search, should you need it, using the in keyword.

You might be interested in this article by a regular expression devotee. Regular expressions are indeed great, but they shouldn't be the first tool you reach for in string matching problems.

If it became necessary for some reason to use regexen than you 'd be interested in the re.escape method,, which will quote special characters so they are interpreted as standard characters rather than having their standard regex meaning.

Python raw strings and unicode : how to use Web input as regexp patterns?

Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".

Why do Python regex strings sometimes work without using raw strings?

The example above works because \s and \d are not escape sequences in python. According to the docs:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. 

But it's best to just use raw strings and not worry about what is or isn't a python escape, or worry about changing it later if you change the regex.



Related Topics



Leave a reply



Submit