Keeping Only Certain Rows of a Data Frame Based on a Set of Values

How to select only the rows that have a certain value in one column in R?

There are a few ways to do this:

Base R

dfNintendo[dfNintendo$Platform %in% c("GBA", "Wii", "WiiU"), ]

subset(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))

dplyr package

dplyr::filter(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))

These should do what you want

Keep only those rows in a Pandas DataFrame equal to a certain value (paired multiple columns)

I think what you need is the & operator:

df[(df['B']=='Blue') & (df['C']=='Green')]

How to only keep dataframe rows that shares same value on a specific column

use isin:

df1.loc[df1.A.isin(df2.A)]

   A   B   C   D   E
0  a  10   5  18  20
1  b   9  18  11  13

isin returns a boolean Series which you use to filter :

df1.A.isin(df2.A)
0     True
1     True
2    False
3    False
Name: A, dtype: bool

For deleted rows:

df1 = df1.set_index('A')
df2 = df2.set_index('A')
deleted = df1.index.symmetric_difference(df2.index)
pd.concat([df1, df2]).loc[deleted]
   B  C   D   E
A              
c  8  7  12   5
e  8  7  12   5
f  6  5   3  90
z  6  5   3  90

How do I select rows from a DataFrame based on column values?

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

How to keep rows in a DataFrame based on column unique sets?

The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use

df.drop_duplicates()

or include column name list if you're only looking for unique values within specified columns

df.drop_duplicates(subset=[column_list])#column_list of names you want to compare

Edit:

If that's the case, I guess you could just do

df.groupby([column_list]).first() #first() takes the first values of other columns

And then you could just use df.reset_index() if you want the unique sets as columns again.

Filter certain rows in data frame based on time

Updated Version: For Multiple IDs

This solution is inspired by the responses from this thread

import pandas as pd
df = pd.DataFrame({'ID':['001']*10 + ['002']*10,
                   'Event':['event-1','event-2','event-3','event-final','event-1',
                            'event-2','event-3','event-final','event-1','event-2',
                            'event-1','event-2','event-3','event-final','event-1',
                            'event-2','event-final','event-1','event-2','event-3'],
                   'time':pd.date_range('2021-03-22 09:00:00', periods=20, freq="T")
                })

#converting time to string format to match your data
df['time'] = df['time'].dt.strftime("%H:%M")

#checking for values of 'event-final' and reversing the dataframe to find groupby cumsum
#A value of 0 indicates that its after 'event-final'
#Picking those values will give you the desired results

print (df[df.Event.eq('event-final')[::-1].astype(int).groupby(df.ID).cumsum().eq(0)])

print (df)

The output will be:

     ID    Event   time
8   001  event-1  09:08
9   001  event-2  09:09
17  002  event-1  09:17
18  002  event-2  09:18
19  002  event-3  09:19

For a Dataframe:

     ID        Event   time
0   001      event-1  09:00
1   001      event-2  09:01
2   001      event-3  09:02
3   001  event-final  09:03
4   001      event-1  09:04
5   001      event-2  09:05
6   001      event-3  09:06
7   001  event-final  09:07
8   001      event-1  09:08
9   001      event-2  09:09
10  002      event-1  09:10
11  002      event-2  09:11
12  002      event-3  09:12
13  002  event-final  09:13
14  002      event-1  09:14
15  002      event-2  09:15
16  002  event-final  09:16
17  002      event-1  09:17
18  002      event-2  09:18

Previous Answer for Single ID

You can find the index of the last occurrence of event-final, then list all the values from that point on. And yes, you need to sort_values by time and reset_index before you do this.

import pandas as pd
df = pd.DataFrame({'ID':['001']*10,
                   'Event':['event-1','event-2','event-3','event-final','event-1',
                            'event-2','event-3','event-final','event-1','event-2'],
                   'time':pd.date_range('2021-03-22 09:00:00', periods=10, freq="T")})

#converting time to string format to match your data

df['time'] = df['time'].dt.strftime("%H:%M")

#sorting time in ascending order (assume this is within same day
#if date goes beyond 24 hrs, then you should keep df['time'] in datetime format

df = df.sort_values(by='time').reset_index(drop=True)

print (df)

#find out the index of all events that have `event-final`
#and get only the last one using [-1]

idx = df.index[df['Event']=='event-final'][-1]

#using iloc or loc, you can get all records after the last `event-final` row
print (df.loc[idx+1:])

The output of this will be:

Original DataFrame:

    ID        Event   time
0  001      event-1  09:00
1  001      event-2  09:01
2  001      event-3  09:02
3  001  event-final  09:03
4  001      event-1  09:04
5  001      event-2  09:05
6  001      event-3  09:06
7  001  event-final  09:07
8  001      event-1  09:08
9  001      event-2  09:09

Final DataFrame without the event-final values.

    ID    Event   time
8  001  event-1  09:08
9  001  event-2  09:09

How to keep only a certain set of rows by index in a pandas DataFrame with rule

In this case, making sure that you specify the start with "^" and end of the match with "$" will be appropriate.

import pandas as pd

gooddf = pd.DataFrame({"K":[2,3,4,6,2]}, index=["01", "101", "98", "201", "032"])

match_in = gooddf.index.str.match("^[0-9]{1,2}$")
gooddf[match_in]

How to keep only a certain set of rows by index in a pandas DataFrame

This will do the trick:

gooddf.loc[indices]

An important note: .iloc and .loc are doing slightly different things, which is why you may be getting unexpected results.

You can read deeper into the details of indexing here, but the key thing to understand is that .iloc returns rows according to the positions specified, whereas .loc returns rows according to the index labels specified. So if your indices aren't sorted, .loc and .iloc will behave differently.

Keeping Only Certain Rows of a Data Frame Based on a Set of Values

How to select only the rows that have a certain value in one column in R?

Keep only those rows in a Pandas DataFrame equal to a certain value (paired multiple columns)

How to only keep dataframe rows that shares same value on a specific column

How do I select rows from a DataFrame based on column values?

How to keep rows in a DataFrame based on column unique sets?

Filter certain rows in data frame based on time

Updated Version: For Multiple IDs

Previous Answer for Single ID

How to keep only a certain set of rows by index in a pandas DataFrame with rule

How to keep only a certain set of rows by index in a pandas DataFrame

Related Topics

Leave a reply