Do Regular Expressions from the Re Module Support Word Boundaries (\B)

Do regular expressions from the re module support word boundaries (\b)?

You should be using raw strings in your code

>>> x = 'one two three'
>>> y = re.search(r"\btwo\b", x)
>>> y
<_sre.SRE_Match object at 0x100418a58>
>>>

Also, why don't you try

word = 'two'
re.compile(r'\b%s\b' % word, re.I)

Output:

>>> word = 'two'
>>> k = re.compile(r'\b%s\b' % word, re.I)
>>> x = 'one two three'
>>> y = k.search( x)
>>> y
<_sre.SRE_Match object at 0x100418850>

why \b doesn't work in python re module?

You need to use a raw string, or else the \b is interpreted as a string escape. Use r'\baaa\b'. (Alternatively, you can write '\\b', but that is much more awkward for longer regexes.)

Python Regex Word Boundaries not working as expected

You need to use a raw-string for your Regex pattern (which does not process escape sequences):

>>> import re
>>> a = 'Builders Club The Ohio State'
>>> re.sub(r'\bThe\b', '', a, flags=re.IGNORECASE)
'Builders Club Ohio State'
>>>

Otherwise, \b will be interpreted as a backspace character:

>>> print('x\by')
y
>>> print(r'x\by')
x\by
>>>

Word boundary regex issue

The word boundaries match in the following positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Now, you want to match s.l. that is preceded with a word boundary, and not followed with a word char. You need to replace the trailing \b with a (?!\w) lookaround:

\bs\.l\.(?!\w)

See the regex demo

Use perl=TRUE if you are using base R functions, and it will work as is in stringr functions powered with ICU regex library.

Regex anchor \< versus \b for word boundary

"Does not work" is not correct; one works in some regex dialects, the other in others.

Most "modern" regex dialects (Python, Perl, Ruby, etc) use \b as the word boundary, on both sides.

More traditional regex dialects, like the original egrep, use \< as the left word boundary operator, and \> on the right.

(Strictly speaking, Al Aho's original egrep did not have word boundaries; this feature was added later. Maybe see https://stackoverflow.com/a/39367415/874188 for a one-minute summary of regex history.)

How does \b work when using regular expressions?

\b is a zero width assertion. That means it does not match a character, it matches a position with one thing on the left side and another thing on the right side.

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character, or from \W to \w

Which characters are included in \w depends on your language. At least there are all ASCII letters, all ASCII numbers and the underscore. If your regex engine supports unicode, it could be that there are all letters and numbers in \w that have the unicode property letter or number.

\W are all characters, that are NOT in \w.

\bbrown\s

will match here

The quick brown fox
^^

but not here

The quick bbbbrown fox

because between b and brown is no word boundary, i.e. no change from a non word character to a word character, both characters are included in \w.

If your regex comes to a \b it goes on to the next char, thats the b from brown. Now the \b know's whats on the right side, a word char ==> the b. But now it needs to look back, to let the \b become TRUE, there needs to be a non word character before the b. If there is a space (thats not in \w) then the \b before the b is true. BUT if there is another b then its false and then \bbrown does not match "bbrown"

The regex brown would match both strings "quick brown" and "bbrown", where the regex \bbrown matches only "quick brown" AND NOT "bbrown"

For more details see here on www.regular-expressions.info

regular expression pattern about word boundary in python

You need to use a raw-string here:

>>> import re
>>> re.sub(r"\bor", "*", "organization")
'*ganization'
>>>

Otherwise, Python sees \b, which gets translated to \x08:

>>> '\b'
'\x08'
>>>

Another solution would be to escape the backslash:

>>> import re
>>> re.sub("\\bor", "*", "organization")
'*ganization'
>>>

Yet another (and probably the best) solution would be to use ^ in place of \b:

>>> import re
>>> re.sub("^or", "*", "organization")
'*ganization'
>>>

In Regex, using ^ like this means "match at the start of the string".

In Python, how to replace either an 'a' or 'an' with a number indicating more than one?

In order to satisfy the guzzlers you need to escape the \bs or use the raw input format i.e.

pattern_fat = r'\ban\b|\ba\b'

I've also removed the superfluous format which I suspect caused this confusion!

What exactly is a raw string regex and how can you use it?

Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.

You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.

In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.

First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.

Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:

s = chr(92)+chr(110)
print len(s), s

2 \n

The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."

s = "\n"
print len(s), s

1
 

(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)

To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:

s = "\\n"
print len(s), s

2 \n

What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:

s = r"\n"
print len(s), s

2 \n

So we have three different string representations, all giving the same string, or sequence of characters:

print chr(92)+chr(110) == "\\n" == r"\n"
True

Now, let's turn to regular expressions. The Python docs, 7.2. reRegular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."

If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:

prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")

So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.

Questions

Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.

s = r"\s\tWord"
prog = re.compile(s)

The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.

Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.

Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.

Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.

N.B. All examples and document references are to Python 2.7.

Update: Incorporated clarifications from answers of @Vladislav Zorov and @m.buettner, and from follow-up question of @Aerovistae.



Related Topics



Leave a reply



Submit