How does R represent NA internally?
R uses NaN values as defined for IEEE floats to represent NA_real_
, Inf
and NA
. We can use a simple C++ function to make this explicit:
Rcpp::cppFunction('void print_hex(double x) {
uint64_t y;
static_assert(sizeof x == sizeof y, "Size does not match!");
std::memcpy(&y, &x, sizeof y);
Rcpp::Rcout << std::hex << y << std::endl;
}', plugins = "cpp11", includes = "#include <cstdint>")
print_hex(NA_real_)
#> 7ff80000000007a2
print_hex(Inf)
#> 7ff0000000000000
print_hex(-Inf)
#> fff0000000000000
The exponent (second till 13. bit) is all one. This is the definition of an IEEE NaN. But while for Inf
the mantissa is all zero, this is not the case for NA_real_
. Here some source
code
references.
Internal representation of int NA
R uses the minimum integer value to represent NA. On a 4-byte system, valid integer values are usually -2,147,483,648 to 2,147,483,647 but in R
> .Machine$integer.max
[1] 2147483647
> -.Machine$integer.max
[1] -2147483647
> -.Machine$integer.max - 1L
[1] NA
Warning message:
In -.Machine$integer.max - 1L : NAs produced by integer overflow
Also,
> .Internal(inspect(NA_integer_))
@7fe69bbb79c0 13 INTSXP g0c1 [NAM(7)] (len=1, tl=0) -2147483648
Difference between NA_real_ and NaN
Well. First off, remember that NA
is an R concept that has no equivalent in C. So, by necessity, NA
needs to be represented differently in C. The fact that .Internal(inspect())
does not make this distinction doesn’t mean it isn’t made elsewhere. In fact, it so happens that .Internal(inspect())
uses Rprintf
to print the value’s internal double floating point representation. And, indeed, R NAs are encoded as an NaN value in a C floating point type.
Secondly, you observe that “their only difference is the memory address.” — So what? At least conceptually, distinct memory addresses are fully sufficient to distinguish NA and NaN, nothing more is required.
But as a matter of fact R distinguishes these values by a different route. This is possible because the IEEE 754 double precision floating point format has multiple different representations of NaN, and R reserves a specific one for NAs:
static double R_ValueOfNA(void)
{
/* The gcc shipping with Fedora 9 gets this wrong without
* the volatile declaration. Thanks to Marc Schwartz. */
volatile ieee_double x;
x.word[hw] = 0x7ff00000;
x.word[lw] = 1954;
return x.value;
}
and:
/* is a value known to be a NaN also an R NA? */
int attribute_hidden R_NaN_is_R_NA(double x)
{
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
}
int R_IsNA(double x)
{
return isnan(x) && R_NaN_is_R_NA(x);
}
int R_IsNaN(double x)
{
return isnan(x) && ! R_NaN_is_R_NA(x);
}
(src/main/arithmetic.c
)
R value of as.character(NA)
According to Rinternals.h
, NA_STRING
is a CHARSXP
, which is a "scalar string type (internal only)".
internal NA time series, zoo, R
I gather that every day is represented in your data including weekdays and weekends but the days for which no data is present are NA
(as opposed to not being present at all). In the future please provide some test data for better clarity.
Aside from your solution, if you have enough data you could perform an ar
on weekly data only by extracting the last non-missing value on or before Friday:
library(zoo)
# test data
library(chron) # is.weekend
z <- zoo(100:130, as.Date("2000-01-01") + 0:30)
z[is.weekend(time(z))] <- NA
# extract Fridays
zfri <- na.locf(z)[format(time(z), "%w") == 5]
(If there are no missing Fridays it can be shortened by replacing na.locf(z)
with z
.)
Another possibility is to use 1, 2, ... for the times but give them names in which case you could always find out what date a point belongs to by checking the name of its time.
z1 <- na.omit(z)
time(z1) <- setNames(seq_along(z1), time(z1))
What does the . (dot) mean in ~replace_na(., 0)
This one-sided formula is called a lambda-function.
It is a faster way to write simple anonymous functions, using the internal variable as .
or .x
. I personally prefer .x
, as .
is already used by dplyr
as the left-hand variable of the pipe, which might cause confusion.
In this context (inside mutate_at
), ~replace_na(., 0))
and ~replace_na(.x, 0))
are the same as function(x) replace_na(x, 0)
.
You can try this with the same result:
df <- df %>% mutate_at(vars(var1, var2, var5, var6), function(x) replace_na(x, 0))
Besides, please note that mutate_at
is deprecated as for dplyr 1.0
. You might want to use the new syntax with the across
function:
df <- df %>% mutate(across(c(var1, var2, var5, var6), ~replace_na(.x, 0)))
`as.na` function
Why not use is.na<-
as directed in ?is.na
?
> l <- list(integer(10), numeric(10), character(10), logical(10), complex(10))
> str(lapply(l, function(x) {is.na(x) <- seq_along(x); x}))
List of 5
$ : int [1:10] NA NA NA NA NA NA NA NA NA NA
$ : num [1:10] NA NA NA NA NA NA NA NA NA NA
$ : chr [1:10] NA NA NA NA ...
$ : logi [1:10] NA NA NA NA NA NA ...
$ : cplx [1:10] NA NA NA ...
Understanding how .Internal C functions are handled in R
CAR
and CDR
are how you access pairlist objects, as explained in section 2.1.11 of R Language Definition. CAR
contains the first element, and CDR
contains the remaining elements. An example is given in section 5.10.2 of Writing R Extensions:
#include <R.h>
#include <Rinternals.h>
SEXP convolveE(SEXP args)
{
int i, j, na, nb, nab;
double *xa, *xb, *xab;
SEXP a, b, ab;
a = PROTECT(coerceVector(CADR(args), REALSXP));
b = PROTECT(coerceVector(CADDR(args), REALSXP));
...
}
/* The macros: */
first = CADR(args);
second = CADDR(args);
third = CADDDR(args);
fourth = CAD4R(args);
/* provide convenient ways to access the first four arguments.
* More generally we can use the CDR and CAR macros as in: */
args = CDR(args); a = CAR(args);
args = CDR(args); b = CAR(args);
There's also a TAG
macro to access the names given to the actual arguments.
checkArity
ensures that the number of arguments passed to the function is correct. args
are the actual arguments passed to the function. op
is offset pointer "used for C functions that deal with more than one R function" (quoted from src/main/names.c
, which also contains the table showing the offset and arity for each function).
For example, do_colsum
handles col/rowSums
and col/rowMeans
.
/* Table of .Internal(.) and .Primitive(.) R functions
* ===== ========= ==========
* Each entry is a line with
*
* printname c-entry offset eval arity pp-kind precedence rightassoc
* --------- ------- ------ ---- ----- ------- ---------- ----------
{"colSums", do_colsum, 0, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"colMeans", do_colsum, 1, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"rowSums", do_colsum, 2, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"rowMeans", do_colsum, 3, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
Note that arity
in the above table is 4 because (even though rowSums
et al only have 3 arguments) do_colsum
has 4, which you can see from the .Internal
call in rowSums
:
> rowSums
function (x, na.rm = FALSE, dims = 1L)
{
if (is.data.frame(x))
x <- as.matrix(x)
if (!is.array(x) || length(dn <- dim(x)) < 2L)
stop("'x' must be an array of at least two dimensions")
if (dims < 1L || dims > length(dn) - 1L)
stop("invalid 'dims'")
p <- prod(dn[-(1L:dims)])
dn <- dn[1L:dims]
z <- if (is.complex(x))
.Internal(rowSums(Re(x), prod(dn), p, na.rm)) + (0+1i) *
.Internal(rowSums(Im(x), prod(dn), p, na.rm))
else .Internal(rowSums(x, prod(dn), p, na.rm))
if (length(dn) > 1L) {
dim(z) <- dn
dimnames(z) <- dimnames(x)[1L:dims]
}
else names(z) <- dimnames(x)[[1L]]
z
}
dealing with NA in seasonal cycle analysis R
My first solution, simply manually calculating the seasonal cycle, converting to a dataframe to subtract the vector and then transforming back.
# seasonal cycle
scycle=tapply(c2,cycle(c2),mean,na.rm=T)
# converting to df
df=tapply(c2, list(year=floor(time(c2)), month = cycle(c2)), c)
# subtract seasonal cycle
for (i in 1:nrow(df)){df[i,]=df[i,]-scycle}
# convert back to timeseries
anomco2=ts(c(t(df)),start=start(c2),freq=12)
Not very pretty, and not very efficient either.
The comment of missuse lead me to another Seasonal decompose of monthly data including NA in r I missed with a near duplicate question and this suggested the package zoo, which seems to work really well for additive series
library(zoo)
c2=co2
c2[c2>330&c2<350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
shows that the series is very well reconstructed through the missing period.
The output of deconstruct has the trend and seasonal cycle available. I wish I could transfer my bounty to user https://stackoverflow.com/users/516548/g-grothendieck for this helpful response. Thanks to user missuse too.
However, if the missing portion is at the end of the series, the software has to extrapolate the trend and has more difficulties. The original series (in black) maintains the trend, while the trend is smaller in the reconstructed series (red):
c2=co2
c2[c2>350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
Lastly, if instead the missing portion is at the start of the series, the software is unable to extrapolate backwards in time and throws an error... I feel another SO question coming on...
c2=co2
c2[c2<330]=NA
d=decompose(na.StructTS(c2))
Error in StructTS(y) :
the first value of the time series must not be missing
How to see the source code of R .Internal or .Primitive function?
The R source code of pnorm
is:
function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
.Call(C_pnorm, q, mean, sd, lower.tail, log.p)
So, technically speaking, typing "pnorm" does show you the source code. However, more usefully: The guts of pnorm
are coded in C, so the advice in the previous question view source code in R is only peripherally useful (most of it concentrates on functions hidden in namespaces etc.).
Uwe Ligges's article in R news, Accessing the Sources (p. 43), is a good general reference. From that document:
When looking at R source code, sometimes calls
to one of the following functions show up:.C()
,.Call()
,.Fortran()
,.External()
, or.Internal()
and.Primitive()
. These functions are calling entry points in compiled code such as shared objects,
static libraries or dynamic link libraries. Therefore,
it is necessary to look into the sources of the compiled code, if complete understanding of the code is
required.
...
The first step is to look up the
entry point in file ‘$R HOME/src/main/names.c’, if
the calling R function is either.Primitive()
or.Internal()
. This is done in the following example for the code implementing the ‘simple’ R functionsum()
.
(Emphasis added because the precise function you asked about (sum
) is covered in Ligges's article.)
Depending on how seriously you want to dig into the code, it may be worth downloading and
unpacking the source code as Ligges suggests (for example, then you can use command-line tools
such as grep
to search through the source code). For more casual inspection, you can view
the sources online via the R Subversion server or Winston Chang's github mirror (links here are specifically to src/nmath/pnorm.c
). (Guessing the right place to look, src/nmath/pnorm.c
, takes some familiarity with the structure of the R source code.)
mean
and sum
are both implemented in summary.c.
Related Topics
Dplyr Row_Number Error in Rank
Change The Year in a Datetime Object in R
Group Data Frame by Pattern in R
R: Xmleventparse with Large, Varying-Node Xml Input and Conversion to Data Frame
How to Force Ggplot's Geom_Tile to Fill Every Facet
Visualizing Distance Between Nodes According to Weights - with R
Ggplot Legend Showing Transparency and Fill Color
Using: = in Data.Table with Paste()
Install Previous Versions of R on Ubuntu
Aws Dynamodb Support for "R" Programming Language
R Bookdown - Custom Title Page
Total of a Column in Dt Datatables in Shiny
R: As.Posixct Timezone and Scale_X_Datetime Issues in My Dataset
Importing Multiple .Csv Files into R and Adding a New Column with File Name
Why Ggplot2 Legend Not Show in The Graph
Debugging Package::Function() Although Lazy Evaluation Is Used