Remove IDs with fewer than 9 unique observations
We can use n_distinct
To remove ID
s with less than 9 unique observations
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique
#[1] 2 4
Or
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)
# ID
# <int>
#1 2
#2 4
For unique counts of each ID
df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))
# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1
How to remove individuals with fewer than 5 observations from a data frame
An example using group_by
and filter
from dplyr
package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with
is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Remove groups with less than three unique observations
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr
:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
How do I remove all unique values with certain amount of observations?
We can do a group by 'subproduct' and filter
those groups having number of observations (n()
) greater than or equal to 10
library(dplyr)
dfone %>%
group_by(subproduct) %>%
filter(n() >= 10) %>%
ungroup
Or without any package dependency
subset(dfone, subproduct %in% names(which(table(subproduct) >= 10)))
Omit ID's which occur less than x times with a combination of vectors
Using base R we calculate number of unique
values per SpeciesID
and select only those SpeciesID
which occur greater than equal to 5 times.
df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ]
# SpeciesID IndID
#6 100 14-005
#7 100 14-005
#8 100 14-005
#9 100 14-006
#10 100 14-007
#11 100 14-007
#12 100 14-008
#13 100 14-009
#14 500 16-001
#15 500 16-001
#16 500 16-002
#17 500 16-002
#18 500 16-002
#19 500 16-003
#20 500 16-003
#21 500 16-004
#22 500 16-004
#23 500 16-005
#24 500 16-006
#25 500 16-006
#26 500 16-007
length(unique(x))
can also be replaced by n_distinct
from dplyr
library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]
Or a complete dplyr
solution which is more verbose could be
library(dplyr)
df %>%
group_by(SpeciesID) %>%
filter(n_distinct(IndID) >= 5)
remove IDs that occur x times R
You can use table
like this:
df[df$names %in% names(table(df$names))[table(df$names) >= 5],]
R: remove multiple rows based on missing values in fewer rows
You can use the ddply
function from the plyr
package to 1) subset your data by id
, 2)
apply a function that will return NULL
if the sub data.frame contains NA
in the columns of your choice, or the data.frame itself otherwise, and 3) concatenate everything back into a data.frame.
allData <- data.frame(id = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA
# Which columns to check for NA's in
probeColumns = c('measure1','measure4')
library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
# id session measure1 measure2 measure3 measure4
# 1 2 1 4 4 9 9
# 2 2 2 7 10 6 5
# 3 2 3 8 3 8 1
# 4 3 1 6 6 7 10
# 5 3 2 9 8 4 2
# 6 3 3 11 11 11 4
Remove duplicates based on second column
I think you're looking for something like that:
Example data:
> bind <- data.frame(ABN = rep(1:3, 3),
+ data.month = sample(1:12, 9),
+ other.inf = runif(9))
>
> bind
ABN data.month other.inf
1 1 10 0.8102867
2 2 4 0.2919716
3 3 8 0.3391790
4 1 2 0.3698933
5 2 6 0.9155280
6 3 1 0.2680165
7 1 9 0.7541168
8 2 7 0.2018796
9 3 11 0.1546079
Solution:
> bind %>%
+ group_by(ABN) %>%
+ filter(data.month == max(data.month))
# A tibble: 3 x 3
# Groups: ABN [3]
ABN data.month other.inf
<int> <int> <dbl>
1 1 10 0.810
2 2 7 0.202
3 3 11 0.155
How can I remove rows where frequency of the value is less than 5? Python, Pandas
Global Counts
Use stack
+ value_counts
+ replace
-
v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
(Update)
Columnwise Counts
Call apply
with pd.Series.value_counts
on your columns of interest, and filter in the same manner as before -
v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
Details
Use value_counts
to count values in your dataframe -
c = v.apply(pd.Series.value_counts)
c
Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0
Call replace
, to replace values in the DataFrame with their counts -
i = v.replace(c)
i
Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8
From that point,
m = i.gt(5).all(1)
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
Use the mask to index df
.
Related Topics
Deleting Rows in R Based on Values Over Multiple Columns
Split an Audio File into Pieces of an Arbitrary Size
Sum Across Multiple Columns With Dplyr
Replacing Nas With Latest Non-Na Value
Why Does Summarize or Mutate Not Work With Group_By When I Load 'Plyr' After 'Dplyr'
Relative Frequencies/Proportions With Dplyr
Selecting Data Frame Rows Based on Partial String Match in a Column
How to Find Common Elements from Multiple Vectors
Fitting a Linear Model With Multiple Lhs
Lm' Summary Not Display All Factor Levels
Remove Ids With Fewer Than 9 Unique Observations
How to Keep Columns When Grouping/Summarizing
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
What Are the Differences Between "=" and "≪-" Assignment Operators
Replace Missing Values (Na) With Most Recent Non-Na by Group
Predict() - Maybe I'M Not Understanding It
What Does "The Following Object Is Masked from 'Package:Xxx'" Mean