Subset() a Factor by Its Number of Observation

subset() a factor by its number of observation

You can use the table function as follows:

subset(df, table(FACTOR)[FACTOR] >= 3)
# FACTOR VALUE
# 1 ANTONIO 5
# 2 ANTONIO 8
# 3 ANTONIO 7

To help you understand, see what these return:

table(df$FACTOR)
table(df$FACTOR)[df$FACTOR]
table(df$FACTOR)[df$FACTOR] >= 3

You could also use the ave function to compute the number of observations:

subset(df, ave(VALUE, FACTOR, FUN = length) >= 3)

This last method may be a little more flexible if you have multiple factors like you asked in your comment and updated question. You can do:

subset(df, ave(VALUE, NAME, CLASS, COLOR, FUN = length) >= 3)

subsetting based on number of observations in a factor variable

table, subset that, and match based on the names of that subset. Probably will want to droplevels thereafter.


EIDT

Some sample data:

set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000,
TRUE, prob=rexp(3000))))

Has some categories with few cases

> min(table(data$factor))
[1] 1

Remove records from case with less than 100 of those with the same value of factor.

tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])

Check:

> min(table(data$factor))
[1] 100

Note that data and factor are not very good names since they are also builtin functions.

Subsetting a factor on amount of observations in R

Using the data.table package one gets

require(data.table)
setDT(pcol)

Find the authors with more than 100 occurrences

author_sel <- pcol[, .N, by = .(author)][N > 100]
pcol[author %in% author_sel$author]

Sub setting observations by factor levels with more than x observations

Consider building a boolean vector using Filter and isTRUE from your table call and then run an %in% in subset argument:

boolean_vec <- Filter(isTRUE, table(DT$some_NA_factor) > 16)
boolean_vec
# 1 2 4 5
# TRUE TRUE TRUE TRUE

lm(Happiness ~ Income + some_NA_factor, data=DT,
subset=(Income > 50 & Happiness < 5 & some_NA_factor %in% names(boolean_vec)))

R: Subset factor levels that co-occur with two levels from another factor

Here is one idea. You define groups with Gene. In each group, you want to check if there is more than one unique value.

group_by(df, Gene) %>% 
filter(n_distinct(Tissue) >= 2)

Gene Tissue
<fct> <fct>
1 GeneA TissueA
2 GeneA TissueB

Subset data frame based on number of rows per group

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.


1. table

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. ave and length

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. data.table: .N and .SD:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. data.table: .N and .I:

setDT(df)
df[df[, .I[.N < 3], name]$V1]

See also the related Q&A Count number of observations/rows per group and add result to data frame.



Related Topics



Leave a reply



Submit