using a regex pattern to filter rows from a pandas dataframe
Demo:
In [2]: df
Out[2]:
Word Ratings
0 TLYSFFPK 1
1 SVLENFVGR 2
2 SVFNHAIRH 3
3 KAGEVFIHK 4
In [3]: pat = r'\b.[VIFY][MLFYIA]\w+[LIYVF].[KR]\b'
In [4]: df.Word.str.contains(pat)
Out[4]:
0 False
1 True
2 False
3 False
Name: Word, dtype: bool
In [5]: df[df.Word.str.contains(pat)]
Out[5]:
Word Ratings
1 SVLENFVGR 2
how to filter rows that satisfy a regular expression via pandas
One way is to read the csv as pandas dataframe and then use str.contains to create a mask column
df['mask'] = df[0].str.contains('(\d+[A-Z]+\d+)') #0 is the column name
df = (df[df['mask'] == True]).drop('mask', axis = 1)
You get the desired dataframe, if you wish, you can reset index using df = df.reset_index()
0
0 5;4Z13H;;L
3 5;3LPH14;4567;;O
Second is to first read the csv and create an edit file with only the filtered rows and then read the filtered csv to create the dataframe
with open('filteredData.csv', 'r') as f_in:
with open('filteredData_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile)
for line in f_in:
line = line.strip()
row = []
if bool(re.search("(\d+[A-Z]+\d+)", line)):
row.append(line)
f_out.writerow(row)
df = pd.read_csv('filteredData_edit.csv', header = None)
You get
0
0 5;4Z13H;;L
1 5;3LPH14;4567;;O
From my experience, I would prefer the second method as it would be more efficient to filter out the undesired rows before creating the dataframe.
Filtering rows of a pandas dataframe according to regex values of a column in Python
Your problem seems simple enough to be solved by str.contains.
foreign_states = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'QC', 'SK', 'YT', 'MEX']
foreign_states_precise = [", " + i + "," for i in foreign_states]
df = df[~df.Address.str.contains('|'.join(foreign_states_precise), regex=True)]
Pandas filtering rows with regex pattern present in the row itself
After a bit of modification, here is the result:
df[df.apply(lambda row: re.compile(row['pattern']).match(row['data']) is not None, axis=1)]
How to filter rows from pandas data frame where the specific value matches a RegEx
Use str.contains
with word boundary \b
:
df = pd.DataFrame({"Name":["Mr A","Mrs B","Mrs C","Mr D"]})
print (df[df["Name"].str.contains(r"\bMr\b")])
Name
0 Mr A
3 Mr D
Use regex to filter pandas rows with ~ at beginning AND at end of string
Use pandas.Series.str.match
df[~df.Unit.str.match('^~.*~$')]
Unit line
0 LF 1
1 LS~ 2
2 ~~SF 3
3 CY 4
5 PC 6
7 ~LF 8
Related Topics
Passing Table Name as a Parameter in Psycopg2
Problem with Multi Threaded Python App and Socket Connections
Repeat Rows in a Pandas Dataframe Based on Column Value
Get a Function Argument's Default Value
Pycharm Doesn't Recognise Installed Module
Splitting a List Based on a Delimiter Word
Python: Urlerror: <Urlopen Error [Errno 10060]
Cannot Concatenate 'Str' and 'Float' Objects
Collision Between Masks in Pygame
How to Recursively Find Specific Key in Nested JSON
How to Have Assignment in a Condition
Sorting a List of Dot-Separated Numbers, Like Software Versions
Which Is the Easiest Way to Simulate Keyboard and Mouse on Python