Group by and Filter Data Management Using Dplyr

group by and filter data management using dplyr

Try

d %>% 
group_by(c) %>%
filter(any(b == 1))

Which gives:

#Source: local data frame [6 x 3]
#Groups: c
#
# a b c
#1 1 1 1
#2 2 2 1
#3 3 2 1
#4 4 1 2
#5 5 2 2
#6 6 2 2

How do you filter out data in the first group based data in the second group in dplyr and/or tidyverse

Here are couple of ways to keep the horses in the data that have raced more than 3 times as 2 year old.

  1. Using filter -
library(dplyr)

df %>%
group_by(horse) %>%
filter(sum(age == 2) > 3) %>%
ungroup

# horse age value
# <chr> <dbl> <dbl>
# 1 a 2 20
# 2 a 2 21
# 3 a 2 19
# 4 a 2 23
# 5 a 2 20
# 6 a 3 17
# 7 a 3 16
# 8 a 3 23
# 9 a 4 24
#10 a 4 14
# … with 12 more rows

  1. Using join
df %>%
filter(age == 2) %>%
count(horse) %>%
filter(n > 3) %>%
select(-n) %>%
left_join(df, by = 'horse')

Data Filter based on object and variable using R

From the dataset structure it seems you have some whitespace in your data. You can use trimws to remove it.

dplyr::filter(DfUse, trimws(InstanceType) == 'a1.2xlarge')

With base R subset -

subset(DfUse, trimws(InstanceType) == 'a1.2xlarge')

Filter in a dplyr group only when the condition is met else do not

Actually, I found the answer in another related question.

This uses a data.table one liner which in my case was:

library(data.table)

test <- setDT(test)[, if(any(is.na(stamp_score))) .SD[is.na(stamp_score)] else .SD, .(hit, indx)]

Essentially, this code subsets the group only if there is a NA in the "stamp_score" column else it does not.

Thanks to everyone who tried to help and also helped me improve my question over time.

Want to group data set and filter it twice in r

Update:
with group_split from dplyr package you get a list of quasi new dataframes containing only the intersting information for this group. Once you have your groups you can then easily apply your regression analysis to each list element.
To access a list element you could do df1[[1]] etc... see example:

Once you understand this kind of operations, then perform a new question of the second part of your actual question, :-)

library(dplyr)
df1 <- df %>%
group_by(SIC, Year) %>%
group_split()


df1[[1]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100

df1[[2]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100

etc...
[[1]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100

[[2]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100

[[3]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2000 50000 1500000 9997

[[4]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2001 100000 1000000 9997

[[5]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2002 50000 1500000 9997
2 Company B USA 2002 110000 1100000 9997

[[6]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2003 80000 1800000 9997

[[7]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2005 90000 1200000 9997

[[8]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2006 100000 1200000 9997

We could first use group_by and then summarise()

library(dplyr)
df %>%
group_by(Company, name, Year, SIC) %>%
summarise()
   Company name   Year   SIC
<chr> <chr> <int> <int>
1 Company A 2000 9997
2 Company A 2002 9997
3 Company A 2003 9997
4 Company B 2001 9997
5 Company B 2002 9997
6 Company B 2005 9997
7 Company B 2006 9997
8 Company C 2002 100
9 Company C 2008 100
10 Company D 2008 100

Filtering out groups that only have one type of value in R

You can use any to keep only the groups that have at least one value which is not 'NotActive'.

In dplyr, you can use -

library(dplyr)
example %>% group_by(UserID) %>% filter(any(Status != 'NotActive'))

# UserID Status
# <chr> <chr>
# 1 AAA Cluster 1
# 2 AAA Cluster 1
# 3 AAA Cluster 1
# 4 AAA NotActive
# 5 AAA NotActive
# 6 AAA Cluster 1
# 7 AAA Cluster 2
# 8 AAA Cluster 2
# 9 AAA Cluster 2
#10 CCC NotActive
#11 CCC NotActive
#12 CCC NotActive
#13 CCC NotActive
#14 CCC Cluster 1
#15 CCC Cluster 1
#16 CCC NotActive

The same in base R and data.table.

#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))


#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]

R: Using dplyr to find and filter for a string in a whole data frame

We could use str_detect

library(dplyr)
library(stringr)
find_text_filter <- function(df, tt){
df %>%
filter(if_any(where(is.character), ~str_detect(.x, tt)))

}

-testing

df %>%
find_text_filter("gj")
# A tibble: 1 x 4
# a b d e
# <int> <chr> <chr> <chr>
#1 2 gjgkjguk hhh " kjihi"


Related Topics



Leave a reply



Submit