Remove groups with less than three unique observations
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr
:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
How to delete groups containing less than 3 rows of data in R?
One way to do it is to use the magic n()
function within filter
:
library(dplyr)
my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))
my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)
The n()
function gives the number of rows in the current group (or the number of rows total if there is no grouping).
Remove groups based on number of observations below a certain value using dplyr
Try creating a new variable to store the values that reach the mentioned condition:
library(dplyr)
#Code
new <- df %>% group_by(Group) %>%
mutate(Var=sum(Count>0)) %>%
filter(Var>1) %>% select(-Var)
Output:
# A tibble: 5 x 3
# Groups: Group [1]
Group Year Count
<chr> <dbl> <dbl>
1 B 1 10
2 B 2 15
3 B 3 8
4 B 4 0
5 B 5 6
Remove all groups with more than N observations
Using head
df.groupby('Name').head(2)
Out[375]:
Name Num
0 X 1
1 X 2
2 Y 3
3 Y 4
s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]:
Name Num
2 Y 3
3 Y 4
removing groups by group number of rows in pandas dataframe
You can use slicing:
df = df[df.groupby('token')['active'].transform('count').ge(3)]
output:
token active
2 63 5
3 63 9
4 63 0
Delete a group in data frame if they have the same values
We group by 'ID', and filter
where the 'Reading' have more than one unique
elements (n_distinct
)
library(dplyr)
df %>%
group_by( ID) %>%
filter(n_distinct(Reading) > 1)
Removing groups from dataframe if variable has repeated values
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr
, this is possible with lag
. (You could do the same thing with comparing to the next value, using lead
. Result comes out the same.)
Group the data by variable1
, get the lag
of variable2
, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup
column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
Remove group from data.frame if at least one group member meets condition
Try
library(dplyr)
df2 %>%
group_by(group) %>%
filter(!any(world == "AF"))
Or as per metionned by @akrun:
setDT(df2)[, if(!any(world == "AF")) .SD, group]
Or
setDT(df2)[, if(all(world != "AF")) .SD, group]
Which gives:
#Source: local data frame [7 x 3]
#Groups: group
#
# world place group
#1 AB 1 1
#2 AC 1 1
#3 AD 2 1
#4 AB 1 3
#5 AE 2 3
#6 AC 3 3
#7 AE 1 3
Remove groups which do not have non-consecutive NA values in R
How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.
How to remove individuals with fewer than 5 observations from a data frame
An example using group_by
and filter
from dplyr
package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with
is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Related Topics
Use Trycatch Skip to Next Value of Loop Upon Error
Anova Test Fails on Lme Fits Created with Pasted Formula
Converting Nested List to Dataframe
How to Order the Months Chronologically in Ggplot2 Short of Writing the Months Out
Count Values Separated by a Comma in a Character String
Apply a Function to Every Row of a Matrix or a Data Frame
Element-Wise Mean Over List of Matrices
Assign Unique Id Based on Two Columns
R - Add Column That Counts Sequentially Within Groups But Repeats for Duplicates
Using R to List All Files with a Specified Extension
Subset a Column in Data Frame Based on Another Data Frame/List
How to Get the Maximum Value by Group
Align Ggplot2 Plots Vertically
Ggplot2 Multiple Sub Groups of a Bar Chart