How to select only the rows that have a certain value in one column in R?
There are a few ways to do this:
Base R
dfNintendo[dfNintendo$Platform %in% c("GBA", "Wii", "WiiU"), ]
or
subset(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))
dplyr package
dplyr::filter(dfNintendo, Platform %in% c("GBA", "Wii", "WiiU"))
These should do what you want
Keep only those rows in a Pandas DataFrame equal to a certain value (paired multiple columns)
I think what you need is the &
operator:
df[(df['B']=='Blue') & (df['C']=='Green')]
How to only keep dataframe rows that shares same value on a specific column
use isin
:
df1.loc[df1.A.isin(df2.A)]
A B C D E
0 a 10 5 18 20
1 b 9 18 11 13
isin
returns a boolean Series which you use to filter :
df1.A.isin(df2.A)
0 True
1 True
2 False
3 False
Name: A, dtype: bool
For deleted rows:
df1 = df1.set_index('A')
df2 = df2.set_index('A')
deleted = df1.index.symmetric_difference(df2.index)
pd.concat([df1, df2]).loc[deleted]
B C D E
A
c 8 7 12 5
e 8 7 12 5
f 6 5 3 90
z 6 5 3 90
How do I select rows from a DataFrame based on column values?
To select rows whose column value equals a scalar, some_value
, use ==
:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values
, use isin
:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &
:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, &
binds more tightly than <=
and >=
. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value
, use !=
:
df.loc[df['column_name'] != some_value]
isin
returns a boolean Series, so to select rows whose value is not in some_values
, negate the boolean Series using ~
:
df.loc[~df['column_name'].isin(some_values)]
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin
:
print(df.loc[df['B'].isin(['one','three'])])
yields
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc
:
df = df.set_index(['B'])
print(df.loc['one'])
yields
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin
:
df.loc[df.index.isin(['one','two'])]
yields
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
How to keep rows in a DataFrame based on column unique sets?
The code you are looking for is .drop_duplicates()
Assuming your dataframe variable is df, you can use
df.drop_duplicates()
or include column name list if you're only looking for unique values within specified columns
df.drop_duplicates(subset=[column_list])#column_list of names you want to compare
Edit:
If that's the case, I guess you could just do
df.groupby([column_list]).first() #first() takes the first values of other columns
And then you could just use df.reset_index() if you want the unique sets as columns again.
Filter certain rows in data frame based on time
Updated Version: For Multiple IDs
This solution is inspired by the responses from this thread
import pandas as pd
df = pd.DataFrame({'ID':['001']*10 + ['002']*10,
'Event':['event-1','event-2','event-3','event-final','event-1',
'event-2','event-3','event-final','event-1','event-2',
'event-1','event-2','event-3','event-final','event-1',
'event-2','event-final','event-1','event-2','event-3'],
'time':pd.date_range('2021-03-22 09:00:00', periods=20, freq="T")
})
#converting time to string format to match your data
df['time'] = df['time'].dt.strftime("%H:%M")
#checking for values of 'event-final' and reversing the dataframe to find groupby cumsum
#A value of 0 indicates that its after 'event-final'
#Picking those values will give you the desired results
print (df[df.Event.eq('event-final')[::-1].astype(int).groupby(df.ID).cumsum().eq(0)])
print (df)
The output will be:
ID Event time
8 001 event-1 09:08
9 001 event-2 09:09
17 002 event-1 09:17
18 002 event-2 09:18
19 002 event-3 09:19
For a Dataframe:
ID Event time
0 001 event-1 09:00
1 001 event-2 09:01
2 001 event-3 09:02
3 001 event-final 09:03
4 001 event-1 09:04
5 001 event-2 09:05
6 001 event-3 09:06
7 001 event-final 09:07
8 001 event-1 09:08
9 001 event-2 09:09
10 002 event-1 09:10
11 002 event-2 09:11
12 002 event-3 09:12
13 002 event-final 09:13
14 002 event-1 09:14
15 002 event-2 09:15
16 002 event-final 09:16
17 002 event-1 09:17
18 002 event-2 09:18
Previous Answer for Single ID
You can find the index of the last occurrence of event-final
, then list all the values from that point on. And yes, you need to sort_values by time and reset_index before you do this.
import pandas as pd
df = pd.DataFrame({'ID':['001']*10,
'Event':['event-1','event-2','event-3','event-final','event-1',
'event-2','event-3','event-final','event-1','event-2'],
'time':pd.date_range('2021-03-22 09:00:00', periods=10, freq="T")})
#converting time to string format to match your data
df['time'] = df['time'].dt.strftime("%H:%M")
#sorting time in ascending order (assume this is within same day
#if date goes beyond 24 hrs, then you should keep df['time'] in datetime format
df = df.sort_values(by='time').reset_index(drop=True)
print (df)
#find out the index of all events that have `event-final`
#and get only the last one using [-1]
idx = df.index[df['Event']=='event-final'][-1]
#using iloc or loc, you can get all records after the last `event-final` row
print (df.loc[idx+1:])
The output of this will be:
Original DataFrame:
ID Event time
0 001 event-1 09:00
1 001 event-2 09:01
2 001 event-3 09:02
3 001 event-final 09:03
4 001 event-1 09:04
5 001 event-2 09:05
6 001 event-3 09:06
7 001 event-final 09:07
8 001 event-1 09:08
9 001 event-2 09:09
Final DataFrame without the event-final values.
ID Event time
8 001 event-1 09:08
9 001 event-2 09:09
How to keep only a certain set of rows by index in a pandas DataFrame with rule
In this case, making sure that you specify the start with "^" and end of the match with "$" will be appropriate.
import pandas as pd
gooddf = pd.DataFrame({"K":[2,3,4,6,2]}, index=["01", "101", "98", "201", "032"])
match_in = gooddf.index.str.match("^[0-9]{1,2}$")
gooddf[match_in]
How to keep only a certain set of rows by index in a pandas DataFrame
This will do the trick:
gooddf.loc[indices]
An important note: .iloc
and .loc
are doing slightly different things, which is why you may be getting unexpected results.
You can read deeper into the details of indexing here, but the key thing to understand is that .iloc
returns rows according to the positions specified, whereas .loc
returns rows according to the index labels specified. So if your indices aren't sorted, .loc
and .iloc
will behave differently.
Related Topics
Ggplot2: Fix Colors to Factor Levels
R Data.Table Breaks in Exported Functions
Programmatically Insert Header and Plot in Same Code Chunk with R Markdown Using Results='Asis'
Identifying Where Value Changes in R Data.Frame Column
Reduce File Size of R Markdown HTML Output
Ggpairs Plot with Heatmap of Correlation Values
List Members Can Be Accessed with Partial Name? Is This a Feature
Dplyr Rowwise Sum and Other Functions Like Max
Combinations of Multiple Vectors in R
Align Two Data.Frames Next to Each Other with Knitr
Handle Continuous Missing Values in Time-Series Data
Unknown Timezone Name in R Strptime/As.Posixct