Subset data frame based on number of rows per group
First, two base
alternatives. One relies on table
, and the other on ave
and length
. Then, two data.table
ways.
1. table
tt <- table(df$name)
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]
If you want to walk it through step by step:
# count each 'name', assign result to an object 'tt'
tt <- table(df$name)
# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3
# from the table, select 'name' that occur < 3 times
tt[tt < 3]
# ...their names
names(tt[tt < 3])
# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])
# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]
2. ave
and length
As suggested by @flodel:
df[ave(df$x, df$name, FUN = length) < 3, ]
3. data.table
: .N
and .SD
:
library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]
4. data.table
: .N
and .I
:
setDT(df)
df[df[, .I[.N < 3], name]$V1]
See also the related Q&A Count number of observations/rows per group and add result to data frame.
count number of rows in a data frame in R based on group
Here's an example that shows how table(.)
(or, more closely matching your desired output, data.frame(table(.))
does what it sounds like you are asking for.
Note also how to share reproducible sample data in a way that others can copy and paste into their session.
Here's the (reproducible) sample data:
mydf <- structure(list(ID = c(110L, 111L, 121L, 131L, 141L),
MONTH.YEAR = c("JAN. 2012", "JAN. 2012",
"FEB. 2012", "FEB. 2012",
"MAR. 2012"),
VALUE = c(1000L, 2000L, 3000L, 4000L, 5000L)),
.Names = c("ID", "MONTH.YEAR", "VALUE"),
class = "data.frame", row.names = c(NA, -5L))
mydf
# ID MONTH.YEAR VALUE
# 1 110 JAN. 2012 1000
# 2 111 JAN. 2012 2000
# 3 121 FEB. 2012 3000
# 4 131 FEB. 2012 4000
# 5 141 MAR. 2012 5000
Here's the calculation of the number of rows per group, in two output display formats:
table(mydf$MONTH.YEAR)
#
# FEB. 2012 JAN. 2012 MAR. 2012
# 2 2 1
data.frame(table(mydf$MONTH.YEAR))
# Var1 Freq
# 1 FEB. 2012 2
# 2 JAN. 2012 2
# 3 MAR. 2012 1
Subset a dataframe - grouped and multiple values R
Please do not add data as images, provide data in a reproducible format
You can select the Shop
where both 0 and 1 are present.
library(dplyr)
df %>%
group_by(Shop) %>%
filter(all(c(0, 1) %in% Auto)) %>%
ungroup
# Shop Order Auto
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 1 2 0
#3 1 3 1
data
df <- structure(list(Shop = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Order = c(1,
2, 3, 1, 2, 3, 1, 2, 3), Auto = c(0, 0, 1, 1, 1, 1, 0, 0, 0)),
class = "data.frame", row.names = c(NA, -9L))
Subsetting a data frame based on a minimum number of responses per group of measures
We may use if_any
- loop over the 'Measure1', columns, check for non-NA elements (complete.cases
) and (&
) loop separately over the 'Measure2', do the same, both of the conditions return a single TRUE/FALSE
with if_any
, which will be TRUE only if both are TRUE i.e. if there is at least one non-NA in both sets of columns
library(dplyr)
df %>%
filter(if_any(ends_with('Measure1'), complete.cases ) &
if_any(ends_with('Measure2'), complete.cases))
-output
tester_ID Phase Item1Measure1 Item2Measure1 Item3Measure1 Item1Measure2 Item2Measure2 Item3Measure2
1 A1 Phase1 5 5 NA NA 5 5
2 A3 Phase1 3 NA NA NA NA 3
3 A4 Phase1 4 NA NA NA 4 4
4 A5 Phase1 4 4 4 NA 4 4
5 A6 Phase1 1 1 1 1 1 1
6 A7 Phase1 4 NA NA NA 4 4
7 A1 Phase2 4 4 4 4 NA NA
8 A2 Phase2 5 5 5 5 5 5
9 A3 Phase2 NA NA 1 NA 2 NA
10 A4 Phase2 NA NA 3 NA 4 NA
11 A5 Phase2 NA 3 5 NA 1 NA
12 A7 Phase2 NA 1 NA NA 4 NA
dplyr: subset rows based on count of a column
You can do this:
library(dplyr)
df <- planes %>%
dplyr::group_by(manufacturer) %>%
dplyr::filter(n() > 10)
Summary of Output
df %>% count(manufacturer)
# A tibble: 8 × 2
# Groups: manufacturer [8]
manufacturer n
<chr> <int>
1 AIRBUS 336
2 AIRBUS INDUSTRIE 400
3 BOEING 1630
4 BOMBARDIER INC 368
5 EMBRAER 299
6 MCDONNELL DOUGLAS 120
7 MCDONNELL DOUGLAS AIRCRAFT CO 103
8 MCDONNELL DOUGLAS CORPORATION 14
subset a dataframe based on a matrix of row numbers and save the result in one list
You could try it something like this. The result should be of length ncol(index)
and each element should hold two list elements, training and testing datasets each.
apply(index, MARGIN = 2, FUN = function(x, data) {
# is is "demoted" from a column to a vector
list(train = data[x, ], test = data[-x, ])
}, data = df)
regex to subset rows with a certain range in R data frame
You could use this.
d[grepl("\\d{5}([4-5][0-9]|[3][1-9])", d$V1), ]
# V1 x
# 5 1988131 0.4214841
# 6 1988140 0.6267195
# 7 1988150 0.9193493
# 16 1989140 0.4670599
# 17 1989150 0.7745778
# 18 1989159 0.5258432
# 23 2012250 0.6576061
Alternatively as raw string.
d[grepl(r"{\d{5}([4-5][0-9]|[3][1-9])}", d$V1), ]
See demo.
d <- structure(list(V1 = c(1988100L, 1988110L, 1988120L, 1988130L,
1988131L, 1988140L, 1988150L, 1988160L, 1988170L, 1988180L, 1988190L,
1989100L, 1989110L, 1989120L, 1989130L, 1989140L, 1989150L, 1989159L,
1989160L, 1989170L, 1989180L, 1989190L, 2012250L, 2012260L, 2012270L,
2012280L, 2012290L), x = c(0.161905547836795, 0.995759425219148,
0.0691612137015909, 0.352196655003354, 0.421484081307426, 0.626719528576359,
0.919349306030199, 0.273645391920581, 0.859082776354626, 0.479915214236826,
0.703950216760859, 0.738272070884705, 0.51616885769181, 0.625624869018793,
0.129869017750025, 0.467059890506789, 0.77457784046419, 0.525843213545159,
0.568261774023995, 0.995282813441008, 0.195851961849257, 0.188985655317083,
0.65760614303872, 0.965418175794184, 0.513075938913971, 0.288330526091158,
0.833630389068276)), class = "data.frame", row.names = c(NA,
-27L))
Related Topics
Adding a New Column Based Upon Values in Another Column Using Dplyr
Select the N Most Frequent Values in a Variable
Ggplot2: Setting Geom_Bar Baseline to 1 Instead of Zero
Count Number of Rows Per Group and Add Result to Original Data Frame
Select the Row With the Maximum Value in Each Group
How to Prevent Ifelse() from Turning Date Objects into Numeric Objects
Extract the Maximum Value Within Each Group in a Dataframe
Interpreting "Condition Has Length ≫ 1" Warning from 'If' Function
Combine (Rbind) Data Frames and Create Column With Name of Original Data Frames
Divide All Columns by the Value from the 2Nd Column - Apply for All Rows
Remove Quotes from a Character Vector in R
How to Add a Diagonal Line to a Plot
Minimum (Or Maximum) Value of Each Row Across Multiple Columns
Converting Year and Month ("Yyyy-Mm" Format) to a Date
Subset Rows Corresponding to Max Value by Group Using Data.Table