What is the difference between NA and NA?
When you are dealing with factors
, when the NA
is wrapped in angled brackets ( <NA>
), that indicates thtat it is in fact NA.
When it is NA
without brackets, then it is not NA, but rather a proper factor whose label is "NA"
# Note a 'real' NA and a string with the word "NA"
x <- factor(c("hello", NA, "world", "NA"))
x
[1] hello <NA> world NA
Levels: hello NA world <~~ The string appears as a level, the actual NA does not.
as.numeric(x)
[1] 1 NA 3 2 <~~ The string has a numeric value (here, 2, alphabetically)
The NA's numeric value is just NA
Edit to answer @Arun's question:
R
is simply trying to distinguish between a string whose value are the two letters "NA"
and an actual missing value, NA
Thus the difference you see when displaying df
versus df$y
. Example:
df <- data.frame(x=1:4, y=c("a", NA_character_, "c", "NA"), stringsAsFactors=FALSE)
Note the two different styles of NA:
> df
x y
1 1 a
2 2 <NA>
3 3 c
4 4 NA
However, if we look at just 'df$y'
[1] "a" NA "c" "NA"
But, if we remove the quotation marks (similar to what we see when printing a data.frame to the console):
print(df$y, quote=FALSE)
[1] a <NA> c NA
And thus, we once again have the distinction of NA
via the angled brackets.
What is the difference between na.omit and is.na?
In the call to equal.count
, the object na.omit(algae$mnO2)
will be those values in algae$mn02
that are not NA
.
Now, say that you have this code for the plot:
stripplot(season ~ a3|minO2,data=na.omit(algae))
If there are any columns of algae
that contain NA
in rows where algae$mn02
is not NA
, the rows will not line up, and the plot will not be as expected.
Here's an example where this will happen:
algae<- data.frame(a3=c(NA,1,2), mn02=c(1,2,NA))
algae
## a3 mn02
## 1 NA 1
## 2 1 2
## 3 2 NA
Note the difference between the following two expressions:
na.omit(algae)
## a3 mn02
## 2 1 2
algae[!is.na(algae$mn02),]
## a3 mn02
## 1 NA 1
## 2 1 2
The latter will line up with the shingle produced by equal.count(na.omit(algae$mn02))
but the former will not. The first expression here has one less row because there is an incomplete case where mn02
is not NA
.
Note:
equal.count(na.omit(algae$mn02))
##
## Data:
## [1] 1 2
...
There are two elements here. This expression does not check for NA
in columns other than mn02
.
Difference between NA_real_ and NaN
Well. First off, remember that NA
is an R concept that has no equivalent in C. So, by necessity, NA
needs to be represented differently in C. The fact that .Internal(inspect())
does not make this distinction doesn’t mean it isn’t made elsewhere. In fact, it so happens that .Internal(inspect())
uses Rprintf
to print the value’s internal double floating point representation. And, indeed, R NAs are encoded as an NaN value in a C floating point type.
Secondly, you observe that “their only difference is the memory address.” — So what? At least conceptually, distinct memory addresses are fully sufficient to distinguish NA and NaN, nothing more is required.
But as a matter of fact R distinguishes these values by a different route. This is possible because the IEEE 754 double precision floating point format has multiple different representations of NaN, and R reserves a specific one for NAs:
static double R_ValueOfNA(void)
{
/* The gcc shipping with Fedora 9 gets this wrong without
* the volatile declaration. Thanks to Marc Schwartz. */
volatile ieee_double x;
x.word[hw] = 0x7ff00000;
x.word[lw] = 1954;
return x.value;
}
and:
/* is a value known to be a NaN also an R NA? */
int attribute_hidden R_NaN_is_R_NA(double x)
{
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
}
int R_IsNA(double x)
{
return isnan(x) && R_NaN_is_R_NA(x);
}
int R_IsNaN(double x)
{
return isnan(x) && ! R_NaN_is_R_NA(x);
}
(src/main/arithmetic.c
)
difference between first non-NA and last non-NA in each row
A vectorized way would be using max.col
where we get "first"
and "last"
non-NA value using ties.method
parameter
#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")
#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])
#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] -
new_df[cbind(1:nrow(df), first_col)]
Using apply
you could do
x_cols <- grep("^x", names(df))
df$new_calc <- apply(df[x_cols], 1, function(x) {
new_x <- x[!is.na(x)]
if (length(new_x) > 0)
new_x[length(new_x)] - new_x[1L]
else NA
})
Difference between complete.cases and !is.na
For an atomic vector, complete.cases
and is.na
will be identical. For more complex objects this will not be the case.
Eg, for, a data.frame is.na.data.frame
will return a logical matrix of the same dimension as the input.
test <- data.frame(a, b =1)
is.na(test)
# a b
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# [4,] TRUE FALSE
# [5,] FALSE FALSE
#[6,] FALSE FALSE
complete.cases(test)
# [1] TRUE TRUE TRUE FALSE TRUE TRUE
Related Topics
Combination Boxplot and Histogram Using Ggplot2
How to Save Summary(Lm) to a File
Using a Loop to Create Multiple Data Frames in R
How to Find the Polygon Nearest to a Point in R
R: Ggplot Display All Dates on X Axis
Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula
Change Internal Function of a Package
How to Change a Value Coded as "Yes" to a Value of 1 in R
Add Dynamic Subtitle Using Ggplot
Replace Accented Characters in R with Non-Accented Counterpart (Utf-8 Encoding)
Keeping Zero Count Combinations When Aggregating with Data.Table
R Programming: How to Get Euler's Number
How to Correctly Interpret Ggplot's Stat_Density2D
Use Pipe Without Feeding First Argument