Remove Ids With Fewer Than 9 Unique Observations

Remove IDs with fewer than 9 unique observations

We can use n_distinct

To remove IDs with less than 9 unique observations

library(dplyr)

df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique

#[1] 2 4

Or

df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)

# ID
# <int>
#1 2
#2 4

For unique counts of each ID

df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))


# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1

How to remove individuals with fewer than 5 observations from a data frame

An example using group_by and filter from dplyr package:

library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))

> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223

df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]

id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223

or with base R:

> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223

Still in base R, using with is a more elegant way to do the very same thing:

df[with(df, id %in% names(which(table(id)>=5))), ]

or:

subset(df, with(df, id %in% names(which(table(id)>=5))))

Remove groups with less than three unique observations

With data.table you could do:

library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]

which gives:

   Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3

Or with dplyr:

library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)

which gives the same result.

How do I remove all unique values with certain amount of observations?

We can do a group by 'subproduct' and filter those groups having number of observations (n()) greater than or equal to 10

library(dplyr)
dfone %>%
group_by(subproduct) %>%
filter(n() >= 10) %>%
ungroup

Or without any package dependency

subset(dfone, subproduct %in% names(which(table(subproduct) >= 10)))

Omit ID's which occur less than x times with a combination of vectors

Using base R we calculate number of unique values per SpeciesID and select only those SpeciesID which occur greater than equal to 5 times.

df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ]

# SpeciesID IndID
#6 100 14-005
#7 100 14-005
#8 100 14-005
#9 100 14-006
#10 100 14-007
#11 100 14-007
#12 100 14-008
#13 100 14-009
#14 500 16-001
#15 500 16-001
#16 500 16-002
#17 500 16-002
#18 500 16-002
#19 500 16-003
#20 500 16-003
#21 500 16-004
#22 500 16-004
#23 500 16-005
#24 500 16-006
#25 500 16-006
#26 500 16-007

length(unique(x)) can also be replaced by n_distinct from dplyr

library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]

Or a complete dplyr solution which is more verbose could be

library(dplyr)
df %>%
group_by(SpeciesID) %>%
filter(n_distinct(IndID) >= 5)

remove IDs that occur x times R

You can use table like this:

df[df$names %in% names(table(df$names))[table(df$names) >= 5],]

R: remove multiple rows based on missing values in fewer rows

You can use the ddply function from the plyr package to 1) subset your data by id, 2)
apply a function that will return NULL if the sub data.frame contains NA in the columns of your choice, or the data.frame itself otherwise, and 3) concatenate everything back into a data.frame.

allData <- data.frame(id       = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA

# Which columns to check for NA's in
probeColumns = c('measure1','measure4')

library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
# id session measure1 measure2 measure3 measure4
# 1 2 1 4 4 9 9
# 2 2 2 7 10 6 5
# 3 2 3 8 3 8 1
# 4 3 1 6 6 7 10
# 5 3 2 9 8 4 2
# 6 3 3 11 11 11 4

Remove duplicates based on second column

I think you're looking for something like that:

Example data:

> bind <- data.frame(ABN = rep(1:3, 3),
+ data.month = sample(1:12, 9),
+ other.inf = runif(9))
>
> bind
ABN data.month other.inf
1 1 10 0.8102867
2 2 4 0.2919716
3 3 8 0.3391790
4 1 2 0.3698933
5 2 6 0.9155280
6 3 1 0.2680165
7 1 9 0.7541168
8 2 7 0.2018796
9 3 11 0.1546079

Solution:

> bind %>%
+ group_by(ABN) %>%
+ filter(data.month == max(data.month))
# A tibble: 3 x 3
# Groups: ABN [3]
ABN data.month other.inf
<int> <int> <dbl>
1 1 10 0.810
2 2 7 0.202
3 3 11 0.155

How can I remove rows where frequency of the value is less than 5? Python, Pandas

Global Counts

Use stack + value_counts + replace -

v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]

Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana

(Update)

Columnwise Counts

Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -

v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]

Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana

Details

Use value_counts to count values in your dataframe -

c = v.apply(pd.Series.value_counts)
c

Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0

Call replace, to replace values in the DataFrame with their counts -

i = v.replace(c)
i

Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8

From that point,

m = i.gt(5).all(1)

0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool

Use the mask to index df.



Related Topics



Leave a reply



Submit