Group by and Filter Data Management Using Dplyr

group by and filter data management using dplyr

Try

d %>% 
  group_by(c) %>% 
  filter(any(b == 1))

Which gives:

#Source: local data frame [6 x 3]
#Groups: c
#
#  a b c
#1 1 1 1
#2 2 2 1
#3 3 2 1
#4 4 1 2
#5 5 2 2
#6 6 2 2

How do you filter out data in the first group based data in the second group in dplyr and/or tidyverse

Here are couple of ways to keep the horses in the data that have raced more than 3 times as 2 year old.

Using filter -

library(dplyr)

df %>%
  group_by(horse) %>%
  filter(sum(age == 2) > 3) %>%
  ungroup

#   horse   age value
#   <chr> <dbl> <dbl>
# 1 a         2    20
# 2 a         2    21
# 3 a         2    19
# 4 a         2    23
# 5 a         2    20
# 6 a         3    17
# 7 a         3    16
# 8 a         3    23
# 9 a         4    24
#10 a         4    14
# … with 12 more rows

Using join

df %>%
  filter(age == 2) %>%
  count(horse) %>%
  filter(n > 3) %>%
  select(-n) %>%
  left_join(df, by = 'horse')

Data Filter based on object and variable using R

From the dataset structure it seems you have some whitespace in your data. You can use trimws to remove it.

dplyr::filter(DfUse, trimws(InstanceType) == 'a1.2xlarge')

With base R subset -

subset(DfUse, trimws(InstanceType) == 'a1.2xlarge')

Filter in a dplyr group only when the condition is met else do not

Actually, I found the answer in another related question.

This uses a data.table one liner which in my case was:

library(data.table)

test <- setDT(test)[, if(any(is.na(stamp_score))) .SD[is.na(stamp_score)] else .SD, .(hit, indx)]

Essentially, this code subsets the group only if there is a NA in the "stamp_score" column else it does not.

Thanks to everyone who tried to help and also helped me improve my question over time.

Want to group data set and filter it twice in r

Update:
with group_split from dplyr package you get a list of quasi new dataframes containing only the intersting information for this group. Once you have your groups you can then easily apply your regression analysis to each list element.
To access a list element you could do df1[[1]] etc... see example:

Once you understand this kind of operations, then perform a new question of the second part of your actual question, :-)

library(dplyr)
df1 <- df %>% 
  group_by(SIC, Year) %>% 
  group_split()
  
  
df1[[1]]
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2002 90000 1900000   100

df1[[2]]
 Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2008 78000 2000000   100
2 Company D     USA       2008 69420  964220   100

etc...

[[1]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2002 90000 1900000   100

[[2]]
# A tibble: 2 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2008 78000 2000000   100
2 Company D     USA       2008 69420  964220   100

[[3]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company A     USA       2000 50000 1500000  9997

[[4]]
# A tibble: 1 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company B     USA       2001 100000 1000000  9997

[[5]]
# A tibble: 2 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company A     USA       2002  50000 1500000  9997
2 Company B     USA       2002 110000 1100000  9997

[[6]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company A     USA       2003 80000 1800000  9997

[[7]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company B     USA       2005 90000 1200000  9997

[[8]]
# A tibble: 1 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company B     USA       2006 100000 1200000  9997

We could first use group_by and then summarise()

library(dplyr)
df %>% 
  group_by(Company, name, Year, SIC) %>% 
  summarise()

   Company name   Year   SIC
   <chr>   <chr> <int> <int>
 1 Company A      2000  9997
 2 Company A      2002  9997
 3 Company A      2003  9997
 4 Company B      2001  9997
 5 Company B      2002  9997
 6 Company B      2005  9997
 7 Company B      2006  9997
 8 Company C      2002   100
 9 Company C      2008   100
10 Company D      2008   100

Filtering out groups that only have one type of value in R

You can use any to keep only the groups that have at least one value which is not 'NotActive'.

In dplyr, you can use -

library(dplyr)
example %>%  group_by(UserID) %>% filter(any(Status != 'NotActive'))

#   UserID Status   
#   <chr>  <chr>    
# 1 AAA    Cluster 1
# 2 AAA    Cluster 1
# 3 AAA    Cluster 1
# 4 AAA    NotActive
# 5 AAA    NotActive
# 6 AAA    Cluster 1
# 7 AAA    Cluster 2
# 8 AAA    Cluster 2
# 9 AAA    Cluster 2
#10 CCC    NotActive
#11 CCC    NotActive
#12 CCC    NotActive
#13 CCC    NotActive
#14 CCC    Cluster 1
#15 CCC    Cluster 1
#16 CCC    NotActive

The same in base R and data.table.

#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))


#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]

R: Using dplyr to find and filter for a string in a whole data frame

We could use str_detect

library(dplyr)
library(stringr)
find_text_filter <- function(df, tt){
   df %>%
    filter(if_any(where(is.character), ~str_detect(.x, tt)))

  }

-testing

df %>%
     find_text_filter("gj")
# A tibble: 1 x 4
#      a b        d     e        
#  <int> <chr>    <chr> <chr>    
#1     2 gjgkjguk hhh   "  kjihi"

Group by and Filter Data Management Using Dplyr