group by and filter data management using dplyr
Try
d %>%
group_by(c) %>%
filter(any(b == 1))
Which gives:
#Source: local data frame [6 x 3]
#Groups: c
#
# a b c
#1 1 1 1
#2 2 2 1
#3 3 2 1
#4 4 1 2
#5 5 2 2
#6 6 2 2
How do you filter out data in the first group based data in the second group in dplyr and/or tidyverse
Here are couple of ways to keep the horses in the data that have raced more than 3 times as 2 year old.
- Using
filter
-
library(dplyr)
df %>%
group_by(horse) %>%
filter(sum(age == 2) > 3) %>%
ungroup
# horse age value
# <chr> <dbl> <dbl>
# 1 a 2 20
# 2 a 2 21
# 3 a 2 19
# 4 a 2 23
# 5 a 2 20
# 6 a 3 17
# 7 a 3 16
# 8 a 3 23
# 9 a 4 24
#10 a 4 14
# … with 12 more rows
- Using join
df %>%
filter(age == 2) %>%
count(horse) %>%
filter(n > 3) %>%
select(-n) %>%
left_join(df, by = 'horse')
Data Filter based on object and variable using R
From the dataset structure it seems you have some whitespace in your data. You can use trimws
to remove it.
dplyr::filter(DfUse, trimws(InstanceType) == 'a1.2xlarge')
With base R subset
-
subset(DfUse, trimws(InstanceType) == 'a1.2xlarge')
Filter in a dplyr group only when the condition is met else do not
Actually, I found the answer in another related question.
This uses a data.table
one liner which in my case was:
library(data.table)
test <- setDT(test)[, if(any(is.na(stamp_score))) .SD[is.na(stamp_score)] else .SD, .(hit, indx)]
Essentially, this code subsets the group only if there is a NA
in the "stamp_score" column else it does not.
Thanks to everyone who tried to help and also helped me improve my question over time.
Want to group data set and filter it twice in r
Update:
with group_split
from dplyr
package you get a list of quasi new dataframes containing only the intersting information for this group. Once you have your groups you can then easily apply your regression analysis to each list element.
To access a list element you could do df1[[1]] etc... see example:
Once you understand this kind of operations, then perform a new question of the second part of your actual question, :-)
library(dplyr)
df1 <- df %>%
group_by(SIC, Year) %>%
group_split()
df1[[1]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100
df1[[2]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100
etc...
[[1]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100
[[2]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100
[[3]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2000 50000 1500000 9997
[[4]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2001 100000 1000000 9997
[[5]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2002 50000 1500000 9997
2 Company B USA 2002 110000 1100000 9997
[[6]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2003 80000 1800000 9997
[[7]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2005 90000 1200000 9997
[[8]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2006 100000 1200000 9997
We could first use group_by
and then summarise()
library(dplyr)
df %>%
group_by(Company, name, Year, SIC) %>%
summarise()
Company name Year SIC
<chr> <chr> <int> <int>
1 Company A 2000 9997
2 Company A 2002 9997
3 Company A 2003 9997
4 Company B 2001 9997
5 Company B 2002 9997
6 Company B 2005 9997
7 Company B 2006 9997
8 Company C 2002 100
9 Company C 2008 100
10 Company D 2008 100
Filtering out groups that only have one type of value in R
You can use any
to keep only the groups that have at least one value which is not 'NotActive'
.
In dplyr
, you can use -
library(dplyr)
example %>% group_by(UserID) %>% filter(any(Status != 'NotActive'))
# UserID Status
# <chr> <chr>
# 1 AAA Cluster 1
# 2 AAA Cluster 1
# 3 AAA Cluster 1
# 4 AAA NotActive
# 5 AAA NotActive
# 6 AAA Cluster 1
# 7 AAA Cluster 2
# 8 AAA Cluster 2
# 9 AAA Cluster 2
#10 CCC NotActive
#11 CCC NotActive
#12 CCC NotActive
#13 CCC NotActive
#14 CCC Cluster 1
#15 CCC Cluster 1
#16 CCC NotActive
The same in base R and data.table
.
#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))
#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]
R: Using dplyr to find and filter for a string in a whole data frame
We could use str_detect
library(dplyr)
library(stringr)
find_text_filter <- function(df, tt){
df %>%
filter(if_any(where(is.character), ~str_detect(.x, tt)))
}
-testing
df %>%
find_text_filter("gj")
# A tibble: 1 x 4
# a b d e
# <int> <chr> <chr> <chr>
#1 2 gjgkjguk hhh " kjihi"
Related Topics
How to Use a List as a Hash in R? If So, Why Is It So Slow
Multiple Ggplots of Different Sizes
Knitr Gets Tricked by Data.Table ':=' Assignment
Insert a Logo in Upper Right Corner of R Markdown PDF Document
How to Use Map from Purrr with Dplyr::Mutate to Create Multiple New Columns Based on Column Pairs
Controlling Order of Facet_Grid/Facet_Wrap in Ggplot2
How to 'Print' or 'Cat' When Using Parallel
How to Index an Element of a List Object in R
Case-Insensitive Search of a List in R
Analyzing Daily/Weekly Data Using Ts in R
Cut() Error - 'Breaks' Are Not Unique
Align Multiple Plots in Ggplot2 When Some Have Legends and Others Don'T
Set Only Lower Bound of a Limit for Ggplot
R Color Scatter Plot Points Based on Values
Ggplot2:Plot Mean with Geom_Bar
Stacked Bars Within Grouped Bar Chart