How Does R Represent Na Internally

How does R represent NA internally?

R uses NaN values as defined for IEEE floats to represent NA_real_, Inf and NA. We can use a simple C++ function to make this explicit:

Rcpp::cppFunction('void print_hex(double x) {
uint64_t y;
static_assert(sizeof x == sizeof y, "Size does not match!");
std::memcpy(&y, &x, sizeof y);
Rcpp::Rcout << std::hex << y << std::endl;
}', plugins = "cpp11", includes = "#include <cstdint>")
print_hex(NA_real_)
#> 7ff80000000007a2
print_hex(Inf)
#> 7ff0000000000000
print_hex(-Inf)
#> fff0000000000000

The exponent (second till 13. bit) is all one. This is the definition of an IEEE NaN. But while for Inf the mantissa is all zero, this is not the case for NA_real_. Here some source
code
references.

Internal representation of int NA

R uses the minimum integer value to represent NA. On a 4-byte system, valid integer values are usually -2,147,483,648 to 2,147,483,647 but in R

> .Machine$integer.max
[1] 2147483647
> -.Machine$integer.max
[1] -2147483647
> -.Machine$integer.max - 1L
[1] NA
Warning message:
In -.Machine$integer.max - 1L : NAs produced by integer overflow

Also,

> .Internal(inspect(NA_integer_))
@7fe69bbb79c0 13 INTSXP g0c1 [NAM(7)] (len=1, tl=0) -2147483648

Difference between NA_real_ and NaN

Well. First off, remember that NA is an R concept that has no equivalent in C. So, by necessity, NA needs to be represented differently in C. The fact that .Internal(inspect()) does not make this distinction doesn’t mean it isn’t made elsewhere. In fact, it so happens that .Internal(inspect()) uses Rprintf to print the value’s internal double floating point representation. And, indeed, R NAs are encoded as an NaN value in a C floating point type.

Secondly, you observe that “their only difference is the memory address.” — So what? At least conceptually, distinct memory addresses are fully sufficient to distinguish NA and NaN, nothing more is required.

But as a matter of fact R distinguishes these values by a different route. This is possible because the IEEE 754 double precision floating point format has multiple different representations of NaN, and R reserves a specific one for NAs:

static double R_ValueOfNA(void)
{
/* The gcc shipping with Fedora 9 gets this wrong without
* the volatile declaration. Thanks to Marc Schwartz. */
volatile ieee_double x;
x.word[hw] = 0x7ff00000;
x.word[lw] = 1954;
return x.value;
}

and:

/* is a value known to be a NaN also an R NA? */
int attribute_hidden R_NaN_is_R_NA(double x)
{
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
}

int R_IsNA(double x)
{
return isnan(x) && R_NaN_is_R_NA(x);
}

int R_IsNaN(double x)
{
return isnan(x) && ! R_NaN_is_R_NA(x);
}

(src/main/arithmetic.c)

R value of as.character(NA)

According to Rinternals.h, NA_STRING is a CHARSXP, which is a "scalar string type (internal only)".

internal NA time series, zoo, R

I gather that every day is represented in your data including weekdays and weekends but the days for which no data is present are NA (as opposed to not being present at all). In the future please provide some test data for better clarity.

Aside from your solution, if you have enough data you could perform an ar on weekly data only by extracting the last non-missing value on or before Friday:

library(zoo)

# test data
library(chron) # is.weekend
z <- zoo(100:130, as.Date("2000-01-01") + 0:30)
z[is.weekend(time(z))] <- NA

# extract Fridays
zfri <- na.locf(z)[format(time(z), "%w") == 5]

(If there are no missing Fridays it can be shortened by replacing na.locf(z) with z.)

Another possibility is to use 1, 2, ... for the times but give them names in which case you could always find out what date a point belongs to by checking the name of its time.

z1 <- na.omit(z)
time(z1) <- setNames(seq_along(z1), time(z1))

What does the . (dot) mean in ~replace_na(., 0)

This one-sided formula is called a lambda-function.

It is a faster way to write simple anonymous functions, using the internal variable as . or .x. I personally prefer .x, as . is already used by dplyr as the left-hand variable of the pipe, which might cause confusion.

In this context (inside mutate_at), ~replace_na(., 0)) and ~replace_na(.x, 0)) are the same as function(x) replace_na(x, 0).

You can try this with the same result:

df <- df %>% mutate_at(vars(var1, var2, var5, var6), function(x) replace_na(x, 0))

Besides, please note that mutate_at is deprecated as for dplyr 1.0. You might want to use the new syntax with the across function:

df <- df %>% mutate(across(c(var1, var2, var5, var6), ~replace_na(.x, 0)))

`as.na` function

Why not use is.na<- as directed in ?is.na?

> l <- list(integer(10), numeric(10), character(10), logical(10), complex(10))
> str(lapply(l, function(x) {is.na(x) <- seq_along(x); x}))
List of 5
$ : int [1:10] NA NA NA NA NA NA NA NA NA NA
$ : num [1:10] NA NA NA NA NA NA NA NA NA NA
$ : chr [1:10] NA NA NA NA ...
$ : logi [1:10] NA NA NA NA NA NA ...
$ : cplx [1:10] NA NA NA ...

Understanding how .Internal C functions are handled in R

CAR and CDR are how you access pairlist objects, as explained in section 2.1.11 of R Language Definition. CAR contains the first element, and CDR contains the remaining elements. An example is given in section 5.10.2 of Writing R Extensions:

#include <R.h>
#include <Rinternals.h>

SEXP convolveE(SEXP args)
{
int i, j, na, nb, nab;
double *xa, *xb, *xab;
SEXP a, b, ab;

a = PROTECT(coerceVector(CADR(args), REALSXP));
b = PROTECT(coerceVector(CADDR(args), REALSXP));
...
}
/* The macros: */
first = CADR(args);
second = CADDR(args);
third = CADDDR(args);
fourth = CAD4R(args);
/* provide convenient ways to access the first four arguments.
* More generally we can use the CDR and CAR macros as in: */
args = CDR(args); a = CAR(args);
args = CDR(args); b = CAR(args);

There's also a TAG macro to access the names given to the actual arguments.

checkArity ensures that the number of arguments passed to the function is correct. args are the actual arguments passed to the function. op is offset pointer "used for C functions that deal with more than one R function" (quoted from src/main/names.c, which also contains the table showing the offset and arity for each function).

For example, do_colsum handles col/rowSums and col/rowMeans.

/* Table of  .Internal(.) and .Primitive(.)  R functions
* ===== ========= ==========
* Each entry is a line with
*
* printname c-entry offset eval arity pp-kind precedence rightassoc
* --------- ------- ------ ---- ----- ------- ---------- ----------
{"colSums", do_colsum, 0, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"colMeans", do_colsum, 1, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"rowSums", do_colsum, 2, 11, 4, {PP_FUNCALL, PREC_FN, 0}},
{"rowMeans", do_colsum, 3, 11, 4, {PP_FUNCALL, PREC_FN, 0}},

Note that arity in the above table is 4 because (even though rowSums et al only have 3 arguments) do_colsum has 4, which you can see from the .Internal call in rowSums:

> rowSums
function (x, na.rm = FALSE, dims = 1L)
{
if (is.data.frame(x))
x <- as.matrix(x)
if (!is.array(x) || length(dn <- dim(x)) < 2L)
stop("'x' must be an array of at least two dimensions")
if (dims < 1L || dims > length(dn) - 1L)
stop("invalid 'dims'")
p <- prod(dn[-(1L:dims)])
dn <- dn[1L:dims]
z <- if (is.complex(x))
.Internal(rowSums(Re(x), prod(dn), p, na.rm)) + (0+1i) *
.Internal(rowSums(Im(x), prod(dn), p, na.rm))
else .Internal(rowSums(x, prod(dn), p, na.rm))
if (length(dn) > 1L) {
dim(z) <- dn
dimnames(z) <- dimnames(x)[1L:dims]
}
else names(z) <- dimnames(x)[[1L]]
z
}

dealing with NA in seasonal cycle analysis R

My first solution, simply manually calculating the seasonal cycle, converting to a dataframe to subtract the vector and then transforming back.

# seasonal cycle
scycle=tapply(c2,cycle(c2),mean,na.rm=T)
# converting to df
df=tapply(c2, list(year=floor(time(c2)), month = cycle(c2)), c)
# subtract seasonal cycle
for (i in 1:nrow(df)){df[i,]=df[i,]-scycle}
# convert back to timeseries
anomco2=ts(c(t(df)),start=start(c2),freq=12)

Not very pretty, and not very efficient either.

The comment of missuse lead me to another Seasonal decompose of monthly data including NA in r I missed with a near duplicate question and this suggested the package zoo, which seems to work really well for additive series

library(zoo)
c2=co2
c2[c2>330&c2<350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")

shows that the series is very well reconstructed through the missing period.

black lines shows Co2 series with missing chunk and the red line is the reconstructed series

The output of deconstruct has the trend and seasonal cycle available. I wish I could transfer my bounty to user https://stackoverflow.com/users/516548/g-grothendieck for this helpful response. Thanks to user missuse too.

However, if the missing portion is at the end of the series, the software has to extrapolate the trend and has more difficulties. The original series (in black) maintains the trend, while the trend is smaller in the reconstructed series (red):

c2=co2
c2[c2>350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")

extrapolation of data using zoo

Lastly, if instead the missing portion is at the start of the series, the software is unable to extrapolate backwards in time and throws an error... I feel another SO question coming on...

c2=co2
c2[c2<330]=NA
d=decompose(na.StructTS(c2))

Error in StructTS(y) :
the first value of the time series must not be missing

How to see the source code of R .Internal or .Primitive function?

The R source code of pnorm is:

function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) 
.Call(C_pnorm, q, mean, sd, lower.tail, log.p)

So, technically speaking, typing "pnorm" does show you the source code. However, more usefully: The guts of pnorm are coded in C, so the advice in the previous question view source code in R is only peripherally useful (most of it concentrates on functions hidden in namespaces etc.).

Uwe Ligges's article in R news, Accessing the Sources (p. 43), is a good general reference. From that document:

When looking at R source code, sometimes calls
to one of the following functions show up: .C(),
.Call(), .Fortran(), .External(), or .Internal()
and .Primitive(). These functions are calling entry points in compiled code such as shared objects,
static libraries or dynamic link libraries. Therefore,
it is necessary to look into the sources of the compiled code, if complete understanding of the code is
required.
...
The first step is to look up the
entry point in file ‘$R HOME/src/main/names.c’, if
the calling R function is either .Primitive() or
.Internal(). This is done in the following example for the code implementing the ‘simple’ R function
sum().

(Emphasis added because the precise function you asked about (sum) is covered in Ligges's article.)

Depending on how seriously you want to dig into the code, it may be worth downloading and
unpacking the source code as Ligges suggests (for example, then you can use command-line tools
such as grep to search through the source code). For more casual inspection, you can view
the sources online via the R Subversion server or Winston Chang's github mirror (links here are specifically to src/nmath/pnorm.c). (Guessing the right place to look, src/nmath/pnorm.c, takes some familiarity with the structure of the R source code.)

mean and sum are both implemented in summary.c.



Related Topics



Leave a reply



Submit