Using R - Delete Rows When a Value Repeated Less Than 3 Times

Delete rows in data frame if entry appears fewer than x times

You can use ave like this:

df[as.numeric(ave(df$Name, df$Name, FUN=length)) >= 2, ]
# Name Age ZipCode
# 1 Joe 16 60559
# 3 Bob 64 94127
# 4 Joe 23 94122
# 5 Bob 45 25462

This answer assumes that df$Name is a character vector, not a factor vector.


You can also continue with table as follows:

x <- table(df$Name)
df[df$Name %in% names(x[x >= 2]), ]
# Name Age ZipCode
# 1 Joe 16 60559
# 3 Bob 64 94127
# 4 Joe 23 94122
# 5 Bob 45 25462

Delete columns that had more than 30% of repeated values or more than 1% of values outside the range defined by the mean +- 2.5 SD in r

Write a function which incorporates all the rules you want to use to delete a column.

remove_col <- function(x) {
tab <- table(x)
sd <- sd(x)
mn <- mean(x)

!(mean(x %in% names(tab[tab > 1])) > 0.3 ||
sum(x > mn + 2.5 * sd | x < mn - 2.5 * sd) > 0.01*length(x))
}

Use it with Filter.

Filter(remove_col, DF)

# x3
#1 4.0
#2 2.0
#3 3.0
#4 4.0
#5 5.0
#6 4.2
#7 4.6
#8 2.2
#9 2.7
#10 2.8

How to delete groups containing less than 3 rows of data in R?

One way to do it is to use the magic n() function within filter:

library(dplyr)

my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))

my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)

The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).

Remove rows which have less than string into a specific column

If you want to use tidyverse packages you could use:

library(dplyr)
library(stringr)

dd %>% filter(str_count(text, " ") >= 3)

Here we assume that "less than 4 strings" means less than 3 spaces. By counting characters, you can have a much more efficient solution than actually going though the work of splitting the string up and allocating the memory for the separate pieces when you don't really need them.

Delete rows in R if a cell contains a value larger than x

rowSums of the logical matrix df > 7 gives the number of 'TRUE' per each row. We get '0' if there are no 'TRUE' for that particular row. By negating the results, '0' will change to 'TRUE", and all other values not equal to 0 will be FALSE. This can be used for subsetting.

df[!rowSums(df >7),]
# a b c
#2 6 6 5
#4 7 4 7

For the 'V2', we use the same principle except that we are getting the logical matrix on a subset of 'df'. ie. selecting only the second and third columns.

df[!rowSums(df[-1] >7),]
# a b c
#2 6 6 5
#3 99 3 6
#4 7 4 7
#6 9 6 3


Related Topics



Leave a reply



Submit