Subset Data Frame Based on Number of Rows Per Group

Subset data frame based on number of rows per group

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.


1. table

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. ave and length

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. data.table: .N and .SD:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. data.table: .N and .I:

setDT(df)
df[df[, .I[.N < 3], name]$V1]

See also the related Q&A Count number of observations/rows per group and add result to data frame.

count number of rows in a data frame in R based on group

Here's an example that shows how table(.) (or, more closely matching your desired output, data.frame(table(.)) does what it sounds like you are asking for.

Note also how to share reproducible sample data in a way that others can copy and paste into their session.

Here's the (reproducible) sample data:

mydf <- structure(list(ID = c(110L, 111L, 121L, 131L, 141L), 
MONTH.YEAR = c("JAN. 2012", "JAN. 2012",
"FEB. 2012", "FEB. 2012",
"MAR. 2012"),
VALUE = c(1000L, 2000L, 3000L, 4000L, 5000L)),
.Names = c("ID", "MONTH.YEAR", "VALUE"),
class = "data.frame", row.names = c(NA, -5L))

mydf
# ID MONTH.YEAR VALUE
# 1 110 JAN. 2012 1000
# 2 111 JAN. 2012 2000
# 3 121 FEB. 2012 3000
# 4 131 FEB. 2012 4000
# 5 141 MAR. 2012 5000

Here's the calculation of the number of rows per group, in two output display formats:

table(mydf$MONTH.YEAR)
#
# FEB. 2012 JAN. 2012 MAR. 2012
# 2 2 1

data.frame(table(mydf$MONTH.YEAR))
# Var1 Freq
# 1 FEB. 2012 2
# 2 JAN. 2012 2
# 3 MAR. 2012 1

Subset a dataframe - grouped and multiple values R

Please do not add data as images, provide data in a reproducible format

You can select the Shop where both 0 and 1 are present.

library(dplyr)

df %>%
group_by(Shop) %>%
filter(all(c(0, 1) %in% Auto)) %>%
ungroup

# Shop Order Auto
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 1 2 0
#3 1 3 1

data

df <- structure(list(Shop = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Order = c(1, 
2, 3, 1, 2, 3, 1, 2, 3), Auto = c(0, 0, 1, 1, 1, 1, 0, 0, 0)),
class = "data.frame", row.names = c(NA, -9L))

Subsetting a data frame based on a minimum number of responses per group of measures

We may use if_any - loop over the 'Measure1', columns, check for non-NA elements (complete.cases) and (&) loop separately over the 'Measure2', do the same, both of the conditions return a single TRUE/FALSE with if_any, which will be TRUE only if both are TRUE i.e. if there is at least one non-NA in both sets of columns

library(dplyr)
df %>%
filter(if_any(ends_with('Measure1'), complete.cases ) &
if_any(ends_with('Measure2'), complete.cases))

-output

 tester_ID  Phase Item1Measure1 Item2Measure1 Item3Measure1 Item1Measure2 Item2Measure2 Item3Measure2
1 A1 Phase1 5 5 NA NA 5 5
2 A3 Phase1 3 NA NA NA NA 3
3 A4 Phase1 4 NA NA NA 4 4
4 A5 Phase1 4 4 4 NA 4 4
5 A6 Phase1 1 1 1 1 1 1
6 A7 Phase1 4 NA NA NA 4 4
7 A1 Phase2 4 4 4 4 NA NA
8 A2 Phase2 5 5 5 5 5 5
9 A3 Phase2 NA NA 1 NA 2 NA
10 A4 Phase2 NA NA 3 NA 4 NA
11 A5 Phase2 NA 3 5 NA 1 NA
12 A7 Phase2 NA 1 NA NA 4 NA

dplyr: subset rows based on count of a column

You can do this:

library(dplyr)

df <- planes %>%
dplyr::group_by(manufacturer) %>%
dplyr::filter(n() > 10)

Summary of Output

df %>% count(manufacturer)

# A tibble: 8 × 2
# Groups: manufacturer [8]
manufacturer n
<chr> <int>
1 AIRBUS 336
2 AIRBUS INDUSTRIE 400
3 BOEING 1630
4 BOMBARDIER INC 368
5 EMBRAER 299
6 MCDONNELL DOUGLAS 120
7 MCDONNELL DOUGLAS AIRCRAFT CO 103
8 MCDONNELL DOUGLAS CORPORATION 14

subset a dataframe based on a matrix of row numbers and save the result in one list

You could try it something like this. The result should be of length ncol(index) and each element should hold two list elements, training and testing datasets each.

apply(index, MARGIN = 2, FUN = function(x, data) {
# is is "demoted" from a column to a vector
list(train = data[x, ], test = data[-x, ])
}, data = df)

regex to subset rows with a certain range in R data frame

You could use this.

d[grepl("\\d{5}([4-5][0-9]|[3][1-9])", d$V1), ]
# V1 x
# 5 1988131 0.4214841
# 6 1988140 0.6267195
# 7 1988150 0.9193493
# 16 1989140 0.4670599
# 17 1989150 0.7745778
# 18 1989159 0.5258432
# 23 2012250 0.6576061

Alternatively as raw string.

d[grepl(r"{\d{5}([4-5][0-9]|[3][1-9])}", d$V1), ]

See demo.



d <- structure(list(V1 = c(1988100L, 1988110L, 1988120L, 1988130L, 
1988131L, 1988140L, 1988150L, 1988160L, 1988170L, 1988180L, 1988190L,
1989100L, 1989110L, 1989120L, 1989130L, 1989140L, 1989150L, 1989159L,
1989160L, 1989170L, 1989180L, 1989190L, 2012250L, 2012260L, 2012270L,
2012280L, 2012290L), x = c(0.161905547836795, 0.995759425219148,
0.0691612137015909, 0.352196655003354, 0.421484081307426, 0.626719528576359,
0.919349306030199, 0.273645391920581, 0.859082776354626, 0.479915214236826,
0.703950216760859, 0.738272070884705, 0.51616885769181, 0.625624869018793,
0.129869017750025, 0.467059890506789, 0.77457784046419, 0.525843213545159,
0.568261774023995, 0.995282813441008, 0.195851961849257, 0.188985655317083,
0.65760614303872, 0.965418175794184, 0.513075938913971, 0.288330526091158,
0.833630389068276)), class = "data.frame", row.names = c(NA,
-27L))


Related Topics



Leave a reply



Submit