Test for Na and Select Values Based on Result

Test for NA and select values based on result

use is.na :

DF <- within(DF,
C <- ifelse(!is.na(A),A,B)
)

with DF being your dataframe.

Select only rows if its value in a particular column is 'NA' in R

You can do it also without subset(). To select NA values you should use function is.na().

data[is.na(data$ColWtCL_6),]

Or with subset()

subset(data,is.na(ColWtCL_6))

IF' in 'SELECT' statement - choose output value based on column values

SELECT id, 
IF(type = 'P', amount, amount * -1) as amount
FROM report

See http://dev.mysql.com/doc/refman/5.0/en/control-flow-functions.html.

Additionally, you could handle when the condition is null. In the case of a null amount:

SELECT id, 
IF(type = 'P', IFNULL(amount,0), IFNULL(amount,0) * -1) as amount
FROM report

The part IFNULL(amount,0) means when amount is not null return amount else return 0.

Comparing Column Values With NA

To simplify things, let's first redefine the data frame with stringsAsFactors=FALSE:

df <- read.table(header = TRUE, text = "A B
NA TEST
TEST TEST
Abaxasdas Test", stringsAsFactors=FALSE)

You can compare the columns in a NA-safe way using identical:

mapply(identical, df$A, df$B)

To get the output with "YES" and "NO" instead of TRUE and FALSE:

ifelse(mapply(identical, df$A, df$B), "YES", "NO")

Output

> df$Output <- ifelse(mapply(identical, df$A, df$B), "YES", "NO")
> df
A B Output
1 <NA> TEST NO
2 TEST TEST YES
3 Abaxasdas Test NO

An alternative

As joran suggested in a comment, replacing NA's with a value would make the comparison easier. If you don't want to change the values in the data frame (but maybe you should!), you could use a helper function like this:

rna <- function(x) replace(x, is.na(x), "")
ifelse(rna(df$A)==rna(df$B), "YES", "NO")

How to select the last one test without NA in r

You can use this solution:

> t(apply(d[-1],1,function(rw) rw[range(which(!is.na(rw)))]))

[,1] [,2]
[1,] 62 59
[2,] 49 60
[3,] 59 34

where d is your data set.

How it works: for each row of d (rows are scanned using apply(d[-1],1,...), where d[-1] excludes the first column), get the indices of non-NA test results (which(!is.na(rw))), then get the lowest and highest value of indices by using range(), and obtain the test scores that correspond to those indices (rw[...]). The final result is transposed using t().

Note that this solution will work properly even in the case of NAs in the middle of the test scores, e.g. c(NA, 57, NA, 52, NA).

Detect change from previous rows with missing values - speed up for loop - R

Here's an alternative approach, which removes any rows with NAs, performs some calculations and joins back the NA rows in the right place.

library(tidyverse)
library(zoo)

# example data
test <- data.frame(resp = c(9, NA, NA, 11, NA, NA, 6, 16, NA, 12, 0, 0, 0, 0, 0, NA, 0, 11, NA, NA, NA, NA, NA, NA, 14))

# add an id for each row
test = test %>% mutate(id = row_number())

test %>%
na.omit() %>% # exclude rows with NAs
mutate(flag = case_when(resp == lag(resp, default = first(resp)) ~ 0,
resp > lag(resp, default = first(resp)) ~ 1,
resp < lag(resp, default = first(resp)) ~ -1)) %>% # check relationship between current and previous value
mutate(g = cumsum(flag != lag(flag, default = first(flag)))) %>% # create a grouping based on change in flag column
group_by(g) %>% # for each group
mutate(change = ifelse(flag != 0, flag * row_number(), flag)) %>% # calculate the change column
ungroup() %>% # forget the grouping
select(id, change) %>% # keep useful columns
right_join(test, by="id") %>% # join back to get NA rows in the right place
select(resp, change) # keep useful columns

As a result you'll get:

#    resp change
# 1 9 0
# 2 NA NA
# 3 NA NA
# 4 11 1
# 5 NA NA
# 6 NA NA
# 7 6 -1
# 8 16 1
# 9 NA NA
# 10 12 -1
# 11 0 -2
# 12 0 0
# 13 0 0
# 14 0 0
# 15 0 0
# 16 NA NA
# 17 0 0
# 18 11 1
# 19 NA NA
# 20 NA NA
# 21 NA NA
# 22 NA NA
# 23 NA NA
# 24 NA NA
# 25 14 2

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups.
And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]

# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

Select NA in a data.table in R

Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",], so in practice, this may not really matter much:

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)

benchmark(DT["a",],
DT[is.na(x),],
replications=20)
# test replications elapsed relative user.self sys.self user.child
# 1 DT["a", ] 20 9.18 1.000 7.31 1.83 NA
# 2 DT[is.na(x), ] 20 10.55 1.149 8.69 1.85 NA

===

Addition from Matthew (won't fit in comment) :

The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).

benchmark(DT["a",],  # repeat select of large subset on my netbook
DT[is.na(x),],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", ] 3 2.406 1.000 2.357 0.044
DT[is.na(x), ] 3 3.876 1.611 3.812 0.056

benchmark(DT["a",which=TRUE], # isolate search time
DT[is.na(x),which=TRUE],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", which = TRUE] 3 0.492 1.000 0.492 0.000
DT[is.na(x), which = TRUE] 3 2.941 5.978 2.932 0.004

As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.

Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_ is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey going slower. But it's on the list to revisit.



Related Topics



Leave a reply



Submit