subset() a factor by its number of observation
You can use the table
function as follows:
subset(df, table(FACTOR)[FACTOR] >= 3)
# FACTOR VALUE
# 1 ANTONIO 5
# 2 ANTONIO 8
# 3 ANTONIO 7
To help you understand, see what these return:
table(df$FACTOR)
table(df$FACTOR)[df$FACTOR]
table(df$FACTOR)[df$FACTOR] >= 3
You could also use the ave
function to compute the number of observations:
subset(df, ave(VALUE, FACTOR, FUN = length) >= 3)
This last method may be a little more flexible if you have multiple factors like you asked in your comment and updated question. You can do:
subset(df, ave(VALUE, NAME, CLASS, COLOR, FUN = length) >= 3)
subsetting based on number of observations in a factor variable
table
, subset that, and match based on the names of that subset. Probably will want to droplevels
thereafter.
EIDT
Some sample data:
set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000,
TRUE, prob=rexp(3000))))
Has some categories with few cases
> min(table(data$factor))
[1] 1
Remove records from case with less than 100 of those with the same value of factor
.
tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])
Check:
> min(table(data$factor))
[1] 100
Note that data
and factor
are not very good names since they are also builtin functions.
Subsetting a factor on amount of observations in R
Using the data.table
package one gets
require(data.table)
setDT(pcol)
Find the authors with more than 100 occurrences
author_sel <- pcol[, .N, by = .(author)][N > 100]
pcol[author %in% author_sel$author]
Sub setting observations by factor levels with more than x observations
Consider building a boolean vector using Filter
and isTRUE
from your table
call and then run an %in%
in subset argument:
boolean_vec <- Filter(isTRUE, table(DT$some_NA_factor) > 16)
boolean_vec
# 1 2 4 5
# TRUE TRUE TRUE TRUE
lm(Happiness ~ Income + some_NA_factor, data=DT,
subset=(Income > 50 & Happiness < 5 & some_NA_factor %in% names(boolean_vec)))
R: Subset factor levels that co-occur with two levels from another factor
Here is one idea. You define groups with Gene
. In each group, you want to check if there is more than one unique value.
group_by(df, Gene) %>%
filter(n_distinct(Tissue) >= 2)
Gene Tissue
<fct> <fct>
1 GeneA TissueA
2 GeneA TissueB
Subset data frame based on number of rows per group
First, two base
alternatives. One relies on table
, and the other on ave
and length
. Then, two data.table
ways.
1. table
tt <- table(df$name)
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]
If you want to walk it through step by step:
# count each 'name', assign result to an object 'tt'
tt <- table(df$name)
# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3
# from the table, select 'name' that occur < 3 times
tt[tt < 3]
# ...their names
names(tt[tt < 3])
# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])
# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]
2. ave
and length
As suggested by @flodel:
df[ave(df$x, df$name, FUN = length) < 3, ]
3. data.table
: .N
and .SD
:
library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]
4. data.table
: .N
and .I
:
setDT(df)
df[df[, .I[.N < 3], name]$V1]
See also the related Q&A Count number of observations/rows per group and add result to data frame.
Related Topics
Using R - Delete Rows When a Value Repeated Less Than 3 Times
Extracting Common Character Strings from Multiple Vectors of Different Lengths
Use Csl-File for PDF-Output in Bookdown
In Shiny Apps for R, How to Delay the Firing of a Reactive
Get the Vector of Values from Different Columns of a Matrix
Plot Negative Values in Logarithmic Scale with Ggplot 2
Cbind Two Lists of Data.Frames to a New List
How to Build a Crossword-Like Plot for a Boolean Matrix
User Defined Colour Palette in R and Ggpairs
Error in Chol.Default(Cxx):The Leading Minor of Order Is Not Positive Definite
Transposition of a Tibble Using Pivot_Longer() and Pivot_Wider (Tidyverse)
Is There an R Library That Estimates a Multivariate Natural Cubic Spline (Or Similar) Function
Ordered Factors in Ggplot2 Bar Chart