Test for NA and select values based on result
use is.na
:
DF <- within(DF,
C <- ifelse(!is.na(A),A,B)
)
with DF being your dataframe.
Select only rows if its value in a particular column is 'NA' in R
You can do it also without subset()
. To select NA values you should use function is.na()
.
data[is.na(data$ColWtCL_6),]
Or with subset()
subset(data,is.na(ColWtCL_6))
IF' in 'SELECT' statement - choose output value based on column values
SELECT id,
IF(type = 'P', amount, amount * -1) as amount
FROM report
See http://dev.mysql.com/doc/refman/5.0/en/control-flow-functions.html.
Additionally, you could handle when the condition is null. In the case of a null amount:
SELECT id,
IF(type = 'P', IFNULL(amount,0), IFNULL(amount,0) * -1) as amount
FROM report
The part IFNULL(amount,0)
means when amount is not null return amount else return 0.
Comparing Column Values With NA
To simplify things, let's first redefine the data frame with stringsAsFactors=FALSE
:
df <- read.table(header = TRUE, text = "A B
NA TEST
TEST TEST
Abaxasdas Test", stringsAsFactors=FALSE)
You can compare the columns in a NA
-safe way using identical
:
mapply(identical, df$A, df$B)
To get the output with "YES" and "NO" instead of TRUE
and FALSE
:
ifelse(mapply(identical, df$A, df$B), "YES", "NO")
Output
> df$Output <- ifelse(mapply(identical, df$A, df$B), "YES", "NO")
> df
A B Output
1 <NA> TEST NO
2 TEST TEST YES
3 Abaxasdas Test NO
An alternative
As joran suggested in a comment, replacing NA's with a value would make the comparison easier. If you don't want to change the values in the data frame (but maybe you should!), you could use a helper function like this:
rna <- function(x) replace(x, is.na(x), "")
ifelse(rna(df$A)==rna(df$B), "YES", "NO")
How to select the last one test without NA in r
You can use this solution:
> t(apply(d[-1],1,function(rw) rw[range(which(!is.na(rw)))]))
[,1] [,2]
[1,] 62 59
[2,] 49 60
[3,] 59 34
where d
is your data set.
How it works: for each row of d
(rows are scanned using apply(d[-1],1,...)
, where d[-1]
excludes the first column), get the indices of non-NA test results (which(!is.na(rw))
), then get the lowest and highest value of indices by using range()
, and obtain the test scores that correspond to those indices (rw[...]
). The final result is transposed using t()
.
Note that this solution will work properly even in the case of NAs in the middle of the test scores, e.g. c(NA, 57, NA, 52, NA)
.
Detect change from previous rows with missing values - speed up for loop - R
Here's an alternative approach, which removes any rows with NAs, performs some calculations and joins back the NA rows in the right place.
library(tidyverse)
library(zoo)
# example data
test <- data.frame(resp = c(9, NA, NA, 11, NA, NA, 6, 16, NA, 12, 0, 0, 0, 0, 0, NA, 0, 11, NA, NA, NA, NA, NA, NA, 14))
# add an id for each row
test = test %>% mutate(id = row_number())
test %>%
na.omit() %>% # exclude rows with NAs
mutate(flag = case_when(resp == lag(resp, default = first(resp)) ~ 0,
resp > lag(resp, default = first(resp)) ~ 1,
resp < lag(resp, default = first(resp)) ~ -1)) %>% # check relationship between current and previous value
mutate(g = cumsum(flag != lag(flag, default = first(flag)))) %>% # create a grouping based on change in flag column
group_by(g) %>% # for each group
mutate(change = ifelse(flag != 0, flag * row_number(), flag)) %>% # calculate the change column
ungroup() %>% # forget the grouping
select(id, change) %>% # keep useful columns
right_join(test, by="id") %>% # join back to get NA rows in the right place
select(resp, change) # keep useful columns
As a result you'll get:
# resp change
# 1 9 0
# 2 NA NA
# 3 NA NA
# 4 11 1
# 5 NA NA
# 6 NA NA
# 7 6 -1
# 8 16 1
# 9 NA NA
# 10 12 -1
# 11 0 -2
# 12 0 0
# 13 0 0
# 14 0 0
# 15 0 0
# 16 NA NA
# 17 0 0
# 18 11 1
# 19 NA NA
# 20 NA NA
# 21 NA NA
# 22 NA NA
# 23 NA NA
# 24 NA NA
# 25 14 2
Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable
Here some data.table-based solutions.
setDT(df_id_year_and_type)
method 1
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
method 2
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
method 3
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
Select NA in a data.table in R
Fortunately, DT[is.na(x),]
is nearly as fast as (e.g.) DT["a",]
, so in practice, this may not really matter much:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
replications=20)
# test replications elapsed relative user.self sys.self user.child
# 1 DT["a", ] 20 9.18 1.000 7.31 1.83 NA
# 2 DT[is.na(x), ] 20 10.55 1.149 8.69 1.85 NA
===
Addition from Matthew (won't fit in comment) :
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
benchmark(DT["a",], # repeat select of large subset on my netbook
DT[is.na(x),],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", ] 3 2.406 1.000 2.357 0.044
DT[is.na(x), ] 3 3.876 1.611 3.812 0.056
benchmark(DT["a",which=TRUE], # isolate search time
DT[is.na(x),which=TRUE],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", which = TRUE] 3 0.492 1.000 0.492 0.000
DT[is.na(x), which = TRUE] 3 2.941 5.978 2.932 0.004
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_
is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey
going slower. But it's on the list to revisit.
Related Topics
What/Where Are the Attributes of a Function Object
Insert Images Using Knitr::Include_Graphics in a for Loop
R Ggplot2 Center Align a Multi-Line Title
Can .Sd Be Viewed from a Browser Within [.Data.Table()
Using 'Fread' to Import CSV File from an Archive into 'R' Without Extracting to Disk
Shade (Fill or Color) Area Under Density Curve by Quantile
Dplyr Count Number of One Specific Value of Variable
R Markdown - Format Text in Code Chunk with New Lines
How Do Add a Column in a Data Frame in R
How to Pass Pandoc_Args to Yaml Header in Rmarkdown
Rcpp Function Calling Another Rcpp Function
Object.Size() Reports Smaller Size Than .Rdata File
Create Multilines from Points, Grouped by Id with Sf Package