Subset Data Frame Based on Number of Rows Per Group

Subset data frame based on number of rows per group

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.

1. `table`

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. `ave` and `length`

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. `data.table`: `.N` and `.SD`:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. `data.table`: `.N` and `.I`:

setDT(df)
df[df[, .I[.N < 3], name]$V1]

See also the related Q&A Count number of observations/rows per group and add result to data frame.

count number of rows in a data frame in R based on group

Here's an example that shows how table(.) (or, more closely matching your desired output, data.frame(table(.)) does what it sounds like you are asking for.

Note also how to share reproducible sample data in a way that others can copy and paste into their session.

Here's the (reproducible) sample data:

mydf <- structure(list(ID = c(110L, 111L, 121L, 131L, 141L), 
                       MONTH.YEAR = c("JAN. 2012", "JAN. 2012", 
                                      "FEB. 2012", "FEB. 2012", 
                                      "MAR. 2012"), 
                       VALUE = c(1000L, 2000L, 3000L, 4000L, 5000L)), 
                  .Names = c("ID", "MONTH.YEAR", "VALUE"), 
                  class = "data.frame", row.names = c(NA, -5L))

mydf
#    ID MONTH.YEAR VALUE
# 1 110  JAN. 2012  1000
# 2 111  JAN. 2012  2000
# 3 121  FEB. 2012  3000
# 4 131  FEB. 2012  4000
# 5 141  MAR. 2012  5000

Here's the calculation of the number of rows per group, in two output display formats:

table(mydf$MONTH.YEAR)
# 
# FEB. 2012 JAN. 2012 MAR. 2012 
#         2         2         1

data.frame(table(mydf$MONTH.YEAR))
#        Var1 Freq
# 1 FEB. 2012    2
# 2 JAN. 2012    2
# 3 MAR. 2012    1

Subset a dataframe - grouped and multiple values R

Please do not add data as images, provide data in a reproducible format

You can select the Shop where both 0 and 1 are present.

library(dplyr)

df %>%
  group_by(Shop) %>%
  filter(all(c(0, 1) %in% Auto)) %>%
  ungroup

#   Shop Order  Auto
#  <dbl> <dbl> <dbl>
#1     1     1     0
#2     1     2     0
#3     1     3     1

data

df <- structure(list(Shop = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Order = c(1, 
2, 3, 1, 2, 3, 1, 2, 3), Auto = c(0, 0, 1, 1, 1, 1, 0, 0, 0)), 
class = "data.frame", row.names = c(NA, -9L))

Subsetting a data frame based on a minimum number of responses per group of measures

We may use if_any - loop over the 'Measure1', columns, check for non-NA elements (complete.cases) and (&) loop separately over the 'Measure2', do the same, both of the conditions return a single TRUE/FALSE with if_any, which will be TRUE only if both are TRUE i.e. if there is at least one non-NA in both sets of columns

library(dplyr)
df %>% 
  filter(if_any(ends_with('Measure1'), complete.cases ) & 
         if_any(ends_with('Measure2'), complete.cases))

-output

 tester_ID  Phase Item1Measure1 Item2Measure1 Item3Measure1 Item1Measure2 Item2Measure2 Item3Measure2
1         A1 Phase1             5             5            NA            NA             5             5
2         A3 Phase1             3            NA            NA            NA            NA             3
3         A4 Phase1             4            NA            NA            NA             4             4
4         A5 Phase1             4             4             4            NA             4             4
5         A6 Phase1             1             1             1             1             1             1
6         A7 Phase1             4            NA            NA            NA             4             4
7         A1 Phase2             4             4             4             4            NA            NA
8         A2 Phase2             5             5             5             5             5             5
9         A3 Phase2            NA            NA             1            NA             2            NA
10        A4 Phase2            NA            NA             3            NA             4            NA
11        A5 Phase2            NA             3             5            NA             1            NA
12        A7 Phase2            NA             1            NA            NA             4            NA

dplyr: subset rows based on count of a column

You can do this:

library(dplyr)

df <- planes %>%
  dplyr::group_by(manufacturer) %>%
  dplyr::filter(n() > 10)

Summary of Output

df %>% count(manufacturer)

# A tibble: 8 × 2
# Groups:   manufacturer [8]
  manufacturer                      n
  <chr>                         <int>
1 AIRBUS                          336
2 AIRBUS INDUSTRIE                400
3 BOEING                         1630
4 BOMBARDIER INC                  368
5 EMBRAER                         299
6 MCDONNELL DOUGLAS               120
7 MCDONNELL DOUGLAS AIRCRAFT CO   103
8 MCDONNELL DOUGLAS CORPORATION    14

subset a dataframe based on a matrix of row numbers and save the result in one list

You could try it something like this. The result should be of length ncol(index) and each element should hold two list elements, training and testing datasets each.

apply(index, MARGIN = 2, FUN = function(x, data) {
  # is is "demoted" from a column to a vector
  list(train = data[x, ], test = data[-x, ])
}, data = df)

regex to subset rows with a certain range in R data frame

You could use this.

d[grepl("\\d{5}([4-5][0-9]|[3][1-9])", d$V1), ]
#         V1         x
# 5  1988131 0.4214841
# 6  1988140 0.6267195
# 7  1988150 0.9193493
# 16 1989140 0.4670599
# 17 1989150 0.7745778
# 18 1989159 0.5258432
# 23 2012250 0.6576061

Alternatively as raw string.

d[grepl(r"{\d{5}([4-5][0-9]|[3][1-9])}", d$V1), ]

See demo.

d <- structure(list(V1 = c(1988100L, 1988110L, 1988120L, 1988130L, 
1988131L, 1988140L, 1988150L, 1988160L, 1988170L, 1988180L, 1988190L, 
1989100L, 1989110L, 1989120L, 1989130L, 1989140L, 1989150L, 1989159L, 
1989160L, 1989170L, 1989180L, 1989190L, 2012250L, 2012260L, 2012270L, 
2012280L, 2012290L), x = c(0.161905547836795, 0.995759425219148, 
0.0691612137015909, 0.352196655003354, 0.421484081307426, 0.626719528576359, 
0.919349306030199, 0.273645391920581, 0.859082776354626, 0.479915214236826, 
0.703950216760859, 0.738272070884705, 0.51616885769181, 0.625624869018793, 
0.129869017750025, 0.467059890506789, 0.77457784046419, 0.525843213545159, 
0.568261774023995, 0.995282813441008, 0.195851961849257, 0.188985655317083, 
0.65760614303872, 0.965418175794184, 0.513075938913971, 0.288330526091158, 
0.833630389068276)), class = "data.frame", row.names = c(NA, 
-27L))

Subset Data Frame Based on Number of Rows Per Group