Keeping Only Certain Rows of a Data Frame Based on a Set of Values

How to select only the rows that have a certain value in one column in R?

There are a few ways to do this:

Base R

dfNintendo[dfNintendo$Platform %in% c("GBA", "Wii", "WiiU"), ]

or

subset(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))

dplyr package

dplyr::filter(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))

These should do what you want

Keep only those rows in a Pandas DataFrame equal to a certain value (paired multiple columns)

I think what you need is the & operator:

df[(df['B']=='Blue') & (df['C']=='Green')]

How to only keep dataframe rows that shares same value on a specific column

use isin:

df1.loc[df1.A.isin(df2.A)]

A B C D E
0 a 10 5 18 20
1 b 9 18 11 13

isin returns a boolean Series which you use to filter :

df1.A.isin(df2.A)
0 True
1 True
2 False
3 False
Name: A, dtype: bool

For deleted rows:

df1 = df1.set_index('A')
df2 = df2.set_index('A')
deleted = df1.index.symmetric_difference(df2.index)
pd.concat([df1, df2]).loc[deleted]
B C D E
A
c 8 7 12 5
e 8 7 12 5
f 6 5 3 90
z 6 5 3 90

How do I select rows from a DataFrame based on column values?

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.


To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B
one foo 0 0
one bar 1 2
one foo 6 12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12

How to keep rows in a DataFrame based on column unique sets?

The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use

df.drop_duplicates()

or include column name list if you're only looking for unique values within specified columns

df.drop_duplicates(subset=[column_list])#column_list of names you want to compare

Edit:

If that's the case, I guess you could just do

df.groupby([column_list]).first() #first() takes the first values of other columns

And then you could just use df.reset_index() if you want the unique sets as columns again.

Filter certain rows in data frame based on time

Updated Version: For Multiple IDs

This solution is inspired by the responses from this thread

import pandas as pd
df = pd.DataFrame({'ID':['001']*10 + ['002']*10,
'Event':['event-1','event-2','event-3','event-final','event-1',
'event-2','event-3','event-final','event-1','event-2',
'event-1','event-2','event-3','event-final','event-1',
'event-2','event-final','event-1','event-2','event-3'],
'time':pd.date_range('2021-03-22 09:00:00', periods=20, freq="T")
})

#converting time to string format to match your data
df['time'] = df['time'].dt.strftime("%H:%M")

#checking for values of 'event-final' and reversing the dataframe to find groupby cumsum
#A value of 0 indicates that its after 'event-final'
#Picking those values will give you the desired results

print (df[df.Event.eq('event-final')[::-1].astype(int).groupby(df.ID).cumsum().eq(0)])

print (df)

The output will be:

     ID    Event   time
8 001 event-1 09:08
9 001 event-2 09:09
17 002 event-1 09:17
18 002 event-2 09:18
19 002 event-3 09:19

For a Dataframe:

     ID        Event   time
0 001 event-1 09:00
1 001 event-2 09:01
2 001 event-3 09:02
3 001 event-final 09:03
4 001 event-1 09:04
5 001 event-2 09:05
6 001 event-3 09:06
7 001 event-final 09:07
8 001 event-1 09:08
9 001 event-2 09:09
10 002 event-1 09:10
11 002 event-2 09:11
12 002 event-3 09:12
13 002 event-final 09:13
14 002 event-1 09:14
15 002 event-2 09:15
16 002 event-final 09:16
17 002 event-1 09:17
18 002 event-2 09:18

Previous Answer for Single ID

You can find the index of the last occurrence of event-final, then list all the values from that point on. And yes, you need to sort_values by time and reset_index before you do this.

import pandas as pd
df = pd.DataFrame({'ID':['001']*10,
'Event':['event-1','event-2','event-3','event-final','event-1',
'event-2','event-3','event-final','event-1','event-2'],
'time':pd.date_range('2021-03-22 09:00:00', periods=10, freq="T")})

#converting time to string format to match your data

df['time'] = df['time'].dt.strftime("%H:%M")

#sorting time in ascending order (assume this is within same day
#if date goes beyond 24 hrs, then you should keep df['time'] in datetime format

df = df.sort_values(by='time').reset_index(drop=True)

print (df)

#find out the index of all events that have `event-final`
#and get only the last one using [-1]

idx = df.index[df['Event']=='event-final'][-1]

#using iloc or loc, you can get all records after the last `event-final` row
print (df.loc[idx+1:])

The output of this will be:

Original DataFrame:

    ID        Event   time
0 001 event-1 09:00
1 001 event-2 09:01
2 001 event-3 09:02
3 001 event-final 09:03
4 001 event-1 09:04
5 001 event-2 09:05
6 001 event-3 09:06
7 001 event-final 09:07
8 001 event-1 09:08
9 001 event-2 09:09

Final DataFrame without the event-final values.

    ID    Event   time
8 001 event-1 09:08
9 001 event-2 09:09

How to keep only a certain set of rows by index in a pandas DataFrame with rule

In this case, making sure that you specify the start with "^" and end of the match with "$" will be appropriate.

import pandas as pd

gooddf = pd.DataFrame({"K":[2,3,4,6,2]}, index=["01", "101", "98", "201", "032"])

match_in = gooddf.index.str.match("^[0-9]{1,2}$")
gooddf[match_in]

How to keep only a certain set of rows by index in a pandas DataFrame

This will do the trick:

gooddf.loc[indices]

An important note: .iloc and .loc are doing slightly different things, which is why you may be getting unexpected results.

You can read deeper into the details of indexing here, but the key thing to understand is that .iloc returns rows according to the positions specified, whereas .loc returns rows according to the index labels specified. So if your indices aren't sorted, .loc and .iloc will behave differently.



Related Topics



Leave a reply



Submit