Counting non NAs in a data frame; getting answer as a vector
Try this:
# define "demo" dataset
ZZZ <- data.frame(n=c(1,2,NA),m=c(6,NA,NA),o=c(7,8,8))
# apply the counting function per columns
apply(ZZZ, 2, function(x) length(which(!is.na(x))))
Having run:
> apply(ZZZ, 2, function(x) length(which(!is.na(x))))
n m o
2 1 3
If you really insist on returning a vector, you might use as.vector
, e.g. by defining this function:
nonNAs <- function(x) {
as.vector(apply(x, 2, function(x) length(which(!is.na(x)))))
}
You could simply run nonNAs(ZZZ)
:
> nonNAs(ZZZ)
[1] 2 1 3
Simple method of counting non-NAs in column of data String
For a data.frame
you can get it using colSums
and is.na
:
set.seed(45)
df <- data.frame(matrix(sample(c(NA,1:5), 50, replace=TRUE), ncol=5))
# X1 X2 X3 X4 X5
# 1 3 2 NA 2 NA
# 2 1 5 1 1 4
# 3 1 1 3 2 3
# 4 2 2 3 5 3
# 5 2 2 5 2 2
# 6 1 2 NA 3 3
# 7 1 5 5 5 2
# 8 3 NA 4 1 5
# 9 1 2 3 NA 1
# 10 NA 1 1 2 2
colSums(!is.na(df))
# X1 X2 X3 X4 X5
# 9 9 8 9 9
Count number of non-NA values for every column in a dataframe
You can also call is.na
on the entire data frame (implicitly coercing to a logical matrix) and call colSums
on the inverted response:
# make sample data
set.seed(47)
df <- as.data.frame(matrix(sample(c(0:1, NA), 100*5, TRUE), 100))
str(df)
#> 'data.frame': 100 obs. of 5 variables:
#> $ V1: int NA 1 NA NA 1 NA 1 1 1 NA ...
#> $ V2: int NA NA NA 1 NA 1 0 1 0 NA ...
#> $ V3: int 1 1 0 1 1 NA NA 1 NA NA ...
#> $ V4: int NA 0 NA 0 0 NA 1 1 NA NA ...
#> $ V5: int NA NA NA 0 0 0 0 0 NA NA ...
colSums(!is.na(df))
#> V1 V2 V3 V4 V5
#> 69 55 62 60 70
Count number of non-NA values by group
Or if you wanted to use data.table:
library(data.table)
dt[,sum(!is.na(X2)),by=.(Color)]
Color V1
1: Red 2
2: Blue 0
3: Green 1
Also its easy enough to use an ifelse()
in your data.table to get an NA for blue instead of 0. See:
dt[,ifelse(sum(!is.na(X2)==0),as.integer(NA),sum(!is.na(X2))),by=.(Color)]
Color V1
1: Red 2
2: Blue NA
3: Green 1
Data:
dt <- as.data.table(fread("Color X1 X2 X3 X4
Red 1 1 0 2
Blue 0 NA 4 1
Red 3 4 3 1
Green 2 2 1 0"))
Efficiently counting non-NA elements in data.table
Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id
to var
, but still option 3 is the fastest on your data.
library(microbenchmark)
library(data.table)
dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)
microbenchmark(times=10L,
dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
dt2[,length(var[!is.na(var)]),by=id],
dt3[,sum(!is.na(var)),by=id],
dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403 10
# dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365 88.54663 86.93693 89.56223 100.57788 10
# dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774 50.65637 49.60966 51.77160 61.92701 10
# dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087 89.09415 89.47084 96.22914 100.55434 10
Count non-NA values by group
You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2
Related Topics
How to Change .Libpaths() Permanently in R
Avoiding the Infamous "Eval(Parse())" Construct
Ggplot: Boxplot of Multiple Column Values
R: Extracting "Clean" Utf-8 Text from a Web Page Scraped with Rcurl
Add Text on Top of a Faceted Dodged Bar Chart
Dplyr Piping Data - Difference Between '.' and '.X'
Does the Term "Vectorization" Mean Different Things in Different Contexts
How to Convert a Date from a Character String
Normalizing Y-Axis in Histograms in R Ggplot to Proportion by Group
An Na in Subsetting a Data.Frame Does Something Unexpected
How to Add Rtools\Bin to the System Path in R
In Ggplot2, Coord_Flip and Free Scales Don't Work Together
Combining New Lines and Italics in Facet Labels with Ggplot2
Adding New Column with Conditional Values Using Ifelse