How to Filter Rows in Pandas by Regex

using a regex pattern to filter rows from a pandas dataframe

Demo:

In [2]: df
Out[2]:
Word Ratings
0 TLYSFFPK 1
1 SVLENFVGR 2
2 SVFNHAIRH 3
3 KAGEVFIHK 4

In [3]: pat = r'\b.[VIFY][MLFYIA]\w+[LIYVF].[KR]\b'

In [4]: df.Word.str.contains(pat)
Out[4]:
0 False
1 True
2 False
3 False
Name: Word, dtype: bool

In [5]: df[df.Word.str.contains(pat)]
Out[5]:
Word Ratings
1 SVLENFVGR 2

how to filter rows that satisfy a regular expression via pandas

One way is to read the csv as pandas dataframe and then use str.contains to create a mask column

df['mask'] = df[0].str.contains('(\d+[A-Z]+\d+)') #0 is the column name
df = (df[df['mask'] == True]).drop('mask', axis = 1)

You get the desired dataframe, if you wish, you can reset index using df = df.reset_index()

    0
0 5;4Z13H;;L
3 5;3LPH14;4567;;O

Second is to first read the csv and create an edit file with only the filtered rows and then read the filtered csv to create the dataframe

with open('filteredData.csv', 'r') as f_in:
with open('filteredData_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile)
for line in f_in:
line = line.strip()
row = []
if bool(re.search("(\d+[A-Z]+\d+)", line)):
row.append(line)
f_out.writerow(row)
df = pd.read_csv('filteredData_edit.csv', header = None)

You get

    0
0 5;4Z13H;;L
1 5;3LPH14;4567;;O

From my experience, I would prefer the second method as it would be more efficient to filter out the undesired rows before creating the dataframe.

Filtering rows of a pandas dataframe according to regex values of a column in Python

Your problem seems simple enough to be solved by str.contains.

foreign_states = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'QC', 'SK', 'YT', 'MEX']
foreign_states_precise = [", " + i + "," for i in foreign_states]

df = df[~df.Address.str.contains('|'.join(foreign_states_precise), regex=True)]

Pandas filtering rows with regex pattern present in the row itself

After a bit of modification, here is the result:

df[df.apply(lambda row: re.compile(row['pattern']).match(row['data']) is not None, axis=1)]

How to filter rows from pandas data frame where the specific value matches a RegEx

Use str.contains with word boundary \b:

df = pd.DataFrame({"Name":["Mr A","Mrs B","Mrs C","Mr D"]})

print (df[df["Name"].str.contains(r"\bMr\b")])

Name
0 Mr A
3 Mr D

Use regex to filter pandas rows with ~ at beginning AND at end of string

Use pandas.Series.str.match

df[~df.Unit.str.match('^~.*~$')]

Unit line
0 LF 1
1 LS~ 2
2 ~~SF 3
3 CY 4
5 PC 6
7 ~LF 8


Related Topics



Leave a reply



Submit