How to Remove Rows with Nas Only If They Are Present in More Than Certain Percentage of Columns

How to remove rows with NAs only if they are present in more than certain percentage of columns?

You can subset based on the row sums of NA values:

test[!rowSums(is.na(test)) > ncol(test)*.3,]

C1 C2 C3 C4 C5
Gene1 0.07 NA 0.05 0.07 0.07
Gene2 0.20 0.18 0.16 0.15 0.15
Gene4 0.32 0.05 0.12 0.13 0.05
Gene5 0.44 0.53 0.46 0.03 0.47
Gene7 0.49 0.55 0.67 0.49 0.89
Gene9 0.10 0.10 0.05 NA 0.09

Remove NAs if gap is greater than certain time interval or certain number of rows

Here's a tidyverse solution that uses rleid from data.table

library(data.table)
library(tidyverse)

df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(datetime, new = rleid(is.na(lat))) %>%
ungroup() %>%
group_by(lat,lon,new) %>%
filter(n()<3) %>%
select(-new)

This gives us:

# A tibble: 15 x 5
new id datetime lat lon
<int> <chr> <dttm> <dbl> <dbl>
1 1 A 2011-10-03 05:00:00 35 -53.4
2 1 A 2011-10-03 06:00:00 35.1 -53.4
3 2 A 2011-10-03 07:00:00 NA NA
4 2 A 2011-10-03 08:00:00 NA NA
5 3 A 2011-10-03 09:00:00 35.1 -53.4
6 3 A 2011-10-03 10:00:00 36.2 -53.6
7 3 A 2011-10-03 23:00:00 36.6 -53.6
8 3 B 2012-11-08 05:00:00 35.8 -53.4
9 4 B 2012-11-08 06:00:00 NA NA
10 5 B 2012-11-08 07:00:00 36 -53.4
11 6 B 2012-11-08 08:00:00 NA NA
12 6 B 2012-11-08 09:00:00 NA NA
13 7 B 2012-11-08 10:00:00 36.5 -53.4
14 7 B 2012-11-08 23:00:00 36.6 -53.4
15 9 B 2012-11-09 05:00:00 36.6 -53.5

Remove rows with missing data in select columns, only if they don't have missing data in all columns (preferably use complete.cases)

Try using rowSums like :

cols <- 11:103
vals <- rowSums(is.na(data_set1[cols]))
data_set2 <- data_set1[!(vals > 0 & vals < length(cols)), ]

Or with complete.cases and rowSums

data_set1[complete.cases(data_set1[cols]) | 
rowSums(is.na(data_set1[cols])) == length(cols) , ]

With reproducible example,

df <- data.frame(a = c(1, 2, 3, NA, 1), b = c(NA, 2, 3, NA, NA), c = 1:5)
cols <- 1:2

vals <- rowSums(is.na(df[cols]))
df[!(vals > 0 & vals < length(cols)), ]

# a b c
#2 2 2 2
#3 3 3 3
#4 NA NA 4

How replace all cases in columns with NA if there are more than x numbers OR more than x letters in the string?

Following your code, you can set to NA if col3 does not have 4 characters:

df %>% 
mutate(col2 = gsub('\\s+', '', toupper(col2)),
col3 = str_extract(col2, "^[0-9]{4,}"),
col4 = str_extract(col2, "[A-Z|a-z].*$"),
across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))

col1 col2 col3 col4
1 1 1042AZ 1042 AZ
2 2 9523PA 9523 PA
3 3 <NA> <NA> <NA>
4 4 <NA> <NA> <NA>
5 5 <NA> <NA> <NA>
6 6 <NA> <NA> <NA>
7 7 1052 1052 <NA>

data

df <- read.table(header = T, text = 'col1   col2
1 "1042AZ"
2 "9523 pa"
3 "dog"
4 "New York"
5 "20000 (usa)"
6 "Outside the country"
7 "1052"')

How to remove all columns that contain more than 2000 NA values?

One base R option could be:

dat[, colMeans(is.na(dat)) <= 0.5]

X1 X2 X4 X5 X6 X8 X10
1 NA 11 NA NA NA 71 NA
2 NA 12 32 NA 52 72 NA
3 3 NA 33 NA 53 73 93
4 4 14 NA 44 NA NA 94
5 5 15 35 NA 55 75 95
6 NA NA 36 46 NA 76 NA
7 NA NA NA 47 57 NA 97
8 8 18 NA 48 NA 78 98
9 9 NA 39 NA 59 79 99
10 NA NA 40 50 NA 80 100

Or using a specified number:

dat[, colSums(is.na(dat)) <= 5]

Or using half of the rows as a criteria:

dat[, colSums(is.na(dat)) <= nrow(dat)/2]

And the same idea with dplyr:

dat %>%
select_if(~ mean(is.na(.)) <= 0.5)

Or using a specified number:

dat %>%
select_if(~ sum(is.na(.)) <= 5)

Similarly, using half of the rows as a criteria:

dat %>%
select_if(~ sum(is.na(.)) <= length(.)/2)

How to remove rows in a dataframe with more than x number of Null values?

If I understand correctly, you need to remove rows only if total nan's in a row is more than 7:

df = df[df.isnull().sum(axis=1) < 7]

This will keep only rows which have nan's less than 7 in the dataframe, and will remove all having nan's > 7.

Delete columns/rows with more than x% missing

To remove columns with some amount of NA, you can use
colMeans(is.na(...))

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

Loop to remove columns with NAs by row and run function (do this for each row)

Is your dataset is a species x site matrix? i.e. rows are sites/samples, and columns are species?

If so, it is normal to have zeroes to indicate a species was not detected at that site, rather than "NA". I understand that just because you did not detect a species, does not mean it wasn't there, but most analyses of ecological community data take that into account.

R - How to delete rows by a value, when NAs are present

The which() command does not display the problem that the subset command has. For example,

ID = c("R1","R2","R3","R4","R5","R6")
col1 = c(1.2,2.35,5,4.3,2.22,1.35)
sp = c("F","F",NA,NA,"T","F")

data = data.frame(ID,col1,sp)
data1 = data[-which(data$sp=="T"),]

Which yields:

> data
ID col1 sp
1 R1 1.20 F
2 R2 2.35 F
3 R3 5.00 <NA>
4 R4 4.30 <NA>
5 R5 2.22 T
6 R6 1.35 F

> data1
ID col1 sp
1 R1 1.20 F
2 R2 2.35 F
3 R3 5.00 <NA>
4 R4 4.30 <NA>
6 R6 1.35 F

Just to make sure you fully understand, which() finds the indices of where T is in the vector sp, i.e.,

> which(data$sp=="T")
[1] 5


Related Topics



Leave a reply



Submit