Python Regex Engine - look-behind requires fixed-width pattern Error
Python lookbehind assertions need to be fixed width, but you can try this:
>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
Explanation:
\b # Start the match at the end of a "word"
\s* # Match optional whitespace
" # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
Python regex error: look-behind requires fixed-width pattern
In python, you may use this work-around to avoid this error:
(?:^|(?<=[\s:]))(:[^\s:]+:)(?=[\s:]|$)
Anchors ^
and $
are zero-width matchers anyway.
RegEx Demo
Regex - look-behind requires fixed-width pattern error
If you want to assert that what is on the left is not eq
it should be a negative lookbehind (?<!
instead of a positive lookbehind.
You can write the pattern using 2 lookbehind assertions.
(?<!\()(?<!eq )'(?!\)|\Z)
Regex demo | Python demo
Example code
import re
text = "('hel'lo') eq 'some 'variable he're'"
print(re.compile(r"(?<!\()(?<!eq )'(?!\)|\Z)").sub(string=text, repl="''"))
Output
('hel''lo') eq 'some ''variable he''re'
Python - error: look-behind requires fixed-width pattern
Here are 2 approaches that will solve the issue:
Chained Lookbehinds
Convert an alternation based lookbehind into several negative lookbehinds since the logical relations between them will be the same (that of AND):
import re
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
c_except = [r"on\s",r"dinas\s"]
c_out = ["avon", "powys", "somerset","hampshire"]
rx = r"(?<!\b{0})({1})".format(r")(?<!\b".join(c_except), "|".join(c_out))
print(re.sub(rx, "", phrase))
See this Python demo.
Capturing Approch
Capture what you need to keep and match only what you need to remove, and use \1
backreference to restore Group 1 value:
import re
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
c_except = [r"on\s+",r"dinas\s+"]
c_out = ["avon", "powys", "somerset","hampshire"]
rx = r"(\b(?:{0})(?:{1}))|(?:{1})".format(r"|".join(c_except), "|".join(c_out))
print(re.sub(rx, r"\1", phrase))
See another Python demo.
Note that this approach is favorable since you may use variable width patterns inside c_except
.
The regex will look like
(\b(?:on\s+|dinas\s+)(?:avon|powys|somerset|hampshire))|(?:avon|powys|somerset|hampshire)
It will match on
or dinas
as whole words due to the \b
word boundary, and then any of the terms you need to remove and since that part is wrapped into a capturing group, you may refer to the capture with \1
backreference. In all other contexts, the c_out
terms will be removed with the |(?:avon|powys|somerset|hampshire)
pattern.
NOTE: The \1
replacement will work in Python 3.5+. For older versions, and Python 2.x, you need to replace it with a lambda:
re.sub(rx, lambda m: m.group(1) if m.group(1) else "", phrase)
Python look-behind regex issue: Invalid regular expression: look-behind requires fixed-width pattern
Python re
module, as most languages (with the notable exception of .NET), doesn't support variable length lookbehind.
Can't you use a capturing group instead ?
“[^”]*(</p>\s*<p[^>]*>)
Data in the first capturing group.
Python look-behind regex fixed-width pattern error while looking for consecutive repeated words
Maybe regexes are not needed at all.
Using itertools.groupby
does the job. It's designed to group equal occurrences of consecutive items.
- group by words (after splitting according to dots)
- convert to list and issue a
tuple
value,count only if length > 1
like this:
import itertools
s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die"
matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]
result:
[('name', 2), ('father', 3)]
So basically we can do whatever we want with this list of tuples (filtering it on the number of occurrences for instance)
Bonus (as I misread the question at first, so I'm leaving it in): to remove the duplicates from the sentence
- group by words (after splitting according to dots) like above
- take only key (value) of the values returned in a list comp (we don't need the values since we don't count)
- join back with dot
In one line (still using itertools
):
new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])
result:
My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die
Regex Pattern doesn't work using look behind without validating the fixed-width pattern
You may use
rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)
See the regex demo
Details
(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)
-Ave
,Rd
,St
,Blvd
,Dr
,Way
,Pl
,Ln
orCt
followed with.
orBeach
,Way
orWalk
\s*
- 0+ whitespaces(.+?)
- Group 1 (this value will be returned by.extract
): any one or more chars other than line break chars, as few as possible\s*
- 0+ whitespaces\d{3}-\d{3}-\d{4}
- 3 digits,-
, 3 digits,-
and 4 digits.
Regex to extract unique string to new column, getting error look-behind requires fixed-width pattern
You may use
.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)
See the regex demo.
Details
.*
- 0+ chars other than line break chars, as many as possible\s
- a whitespace/
- a/
char(?:\s+XO[A-Z0-9\s]*\b)?
- an optional pattern:\s+
- 1+ whitespacesXO
-XO
[A-Z0-9\s]*
- 0+ uppercase letters or digits followed with\b
- a word boundary
\s+
- 1+ whitespaces(.+)
- Group 1 (whatstr.extract
will return): any 1+ chars other than line break chars, as many as possible
In Pandas, use
df['Result'] = df['File Name'].str.extract(r'.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)', expand=False).fillna('')
Result:
Result
0 File Name Type
1 Document Internal Only
2
3 Location Site 3: Park Triangle
4 Block 4 Beach/Dock Camp
5 Blue-print/Register Info Site (RISs)
6 Location Place 5: Drive Place (Active)
7 Area Place 1: Beach Drive
Python regex look-behind requires fixed-width pattern
If you just want to get the title tag,
html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
if "<title>" in item:
print item[ item.find("<title>")+7: ]
Python regex look-behind strange behaviour with character '^'
Reason why first regex is nor working in Python because ^
is a zero width match and Python regex engine doesn't support alternation of zero with and non-zero alternations in the lookbehind assertion.
This is however supported in other engines such as Java, PHP, Perl, C# etc.
To solve this problem, you can use this regex:
(?:^|(?<=b))[0-9]
RegEx Demo
Related Topics
In Practice, What Are the Main Uses for the "Yield From" Syntax in Python 3.3
What's the How to Install Pip, Virtualenv, and Distribute for Python
How to Check Whether a File Is Empty or Not
Interprocess Communication in Python
Split a Large Pandas Dataframe
How to One-Hot-Encode from a Pandas Column Containing a List
How to Split a Column of Tuples in a Pandas Dataframe
Matplotlib Plots: Removing Axis, Legends and White Spaces
Python's Most Efficient Way to Choose Longest String in List
Pandas Dataframe: Replace Nan Values with Average of Columns
Does Python Support Multithreading? Can It Speed Up Execution Time
What's a Correct and Good Way to Implement _Hash_()
How to Get Current Available Gpus in Tensorflow
Running a Linux Command from Python
What Is the Use of Join() in Python Threading