Remove Groups with Less Than Three Unique Observations

Remove groups with less than three unique observations

With data.table you could do:

library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]

which gives:

   Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3

Or with dplyr:

library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)

which gives the same result.

How to delete groups containing less than 3 rows of data in R?

One way to do it is to use the magic n() function within filter:

library(dplyr)

my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))

my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)

The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).

Remove groups based on number of observations below a certain value using dplyr

Try creating a new variable to store the values that reach the mentioned condition:

library(dplyr)
#Code
new <- df %>% group_by(Group) %>%
mutate(Var=sum(Count>0)) %>%
filter(Var>1) %>% select(-Var)

Output:

# A tibble: 5 x 3
# Groups: Group [1]
Group Year Count
<chr> <dbl> <dbl>
1 B 1 10
2 B 2 15
3 B 3 8
4 B 4 0
5 B 5 6

Remove all groups with more than N observations

Using head

df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4

s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4

removing groups by group number of rows in pandas dataframe

You can use slicing:

df = df[df.groupby('token')['active'].transform('count').ge(3)]

output:

   token  active
2 63 5
3 63 9
4 63 0

Delete a group in data frame if they have the same values

We group by 'ID', and filter where the 'Reading' have more than one unique elements (n_distinct)

library(dplyr)
df %>%
group_by( ID) %>%
filter(n_distinct(Reading) > 1)

Removing groups from dataframe if variable has repeated values

To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)

Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.

library(tidyverse)

df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0

Created on 2018-05-10 by the reprex package (v0.2.0).

Remove group from data.frame if at least one group member meets condition

Try

library(dplyr)
df2 %>%
group_by(group) %>%
filter(!any(world == "AF"))

Or as per metionned by @akrun:

setDT(df2)[, if(!any(world == "AF")) .SD, group]

Or

setDT(df2)[, if(all(world != "AF")) .SD, group]

Which gives:

#Source: local data frame [7 x 3]
#Groups: group
#
# world place group
#1 AB 1 1
#2 AC 1 1
#3 AD 2 1
#4 AB 1 3
#5 AE 2 3
#6 AC 3 3
#7 AE 1 3

Remove groups which do not have non-consecutive NA values in R

How about using difference between the index of NA-values per group?

library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))

## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA

I'm not sure this would catch all potential edge cases but it seems to work for the given example.

How to remove individuals with fewer than 5 observations from a data frame

An example using group_by and filter from dplyr package:

library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))

> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223

df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]

id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223

or with base R:

> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223

Still in base R, using with is a more elegant way to do the very same thing:

df[with(df, id %in% names(which(table(id)>=5))), ]

or:

subset(df, with(df, id %in% names(which(table(id)>=5))))


Related Topics



Leave a reply



Submit