What Is Integer Overflow in R and How Can It Happen

What is integer overflow in R and how can it happen?

You can answer many of your questions by reading the help page ?integer. It says:

R uses 32-bit integers for integer vectors, so the range of
representable integers is restricted to about +/-2*10^9.

Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.

If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).

Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.

There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.

Update:

The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.

When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.

R: simple multiplication causes integer overflow

Hopefully a graphic representation of what is happening....

2614 * 1456000
#[1] 3805984000

##  Integers are actually represented as doubles
class( 2614 * 1456000 )
#[1] "numeric"

#  Force numbers to be integers
2614L * 1456000L
#[1] NA
#Warning message:
#In 2614L * 1456000L : NAs produced by integer overflow

##  And the result is an integer with overflow warning
class( 2614L * 1456000L )
#[1] "integer"
#Warning message:
#In 2614L * 1456000L : NAs produced by integer overflow

2614 * 1456000 is a numeric because both the operands are actually of class numeric. The overflow occurs because both nrow and length return integer's and hence the result is an integer but the result exceeds the maximum size representable by the integer class (+/-2*10^9). A numeric or double can hold 2e-308 to 2e+308. So to solve your problem, just use as.numeric(length(A)) or as.double(length(A)).

What is an integer overflow error?

Integer overflow occurs when you try to express a number that is larger than the largest number the integer type can handle.

If you try to express the number 300 in one byte, you have an integer overflow (maximum is 255). 100,000 in two bytes is also an integer overflow (65,535 is the maximum).

You need to care about it because mathematical operations won't behave as you expect. A + B doesn't actually equal the sum of A and B if you have an integer overflow.

You avoid it by not creating the condition in the first place (usually either by choosing your integer type to be large enough that you won't overflow, or by limiting user input so that an overflow doesn't occur).

Error integer overflow - use sum(as.numeric(.))[1] NA in gapminder dataset

Dividing by 1 coerces it numeric.

library(gapminder)
class(sum(gapminder$pop))
[1] "integer"
Warning message:
In sum(gapminder$pop) : integer overflow - use sum(as.numeric(.))

class(sum(gapminder$pop/1))
[1] "numeric"

Benefits of using integer values for constants rather than numeric values (e.g. 1L vs 1) in R

These are some of the use cases in which I explicitly use the L suffix in declaring the constants. Of course these are not strictly "canonical" (or the only ones), but maybe you can have an idea of the rationale behind. I added, for each case, a "necessary" flag; you will see that these arise only if you interface other languages (like C).

Logical type conversion (not necessary)

Instead of using a classic as.integer, I use adding 0L to a logical vector to make it integer. Of course you could just use 0, but this would require more memory (typically 8 bytes instead of four) and a conversion.

Manipulating the result of a function that returns integer (not necessary)

Say for instance that you want to find to retrieve the elements of the vector after a NA. You could:

which(is.na(vec)) + 1L

Since which returns an integer, adding 1L will preserve the type and avoid an implicit conversion. Nothing will happen if you omit the L, since it's just a small optimization. This happens also with match for instance: if you want to post-process the result of such a function, it's good habit to preserve the type if possible.

Interfacing C (necessary)

From ?integer:

Integer vectors exist so that data can be passed to C or Fortran
code which expects them, and so that (small) integer data can be
represented exactly and compactly.

C is much stricter regarding data types. This implies that, if you pass a vector to a C function, you can not rely on C to do the conversions. Say that you want to replace the elements after a NA with some value, say 42. You find the positions of the NA values at the R level (as we did before with which) and then pass the original vector and the vector of indices to C. The C function will look like:

SEXP replaceAfterNA (SEXP X, SEXP IND) {
   ...
   int *ind = INTEGER(IND);
   ...
   for (i=0; i<l; i++) {
        //make here the replacement
   }
}

and the from the R side:

...
ind <- which(is.na(x)) + 1L
.Call("replaceAfterNA", x, ind)
...

If you omit the L in the first line of above, you will receive an error like:

INTEGER() cannot be applied to double vectors

since C is expecting an integer type.

Interfacing Java (necessary)

Same as before. If you use the rJava package and want R to call your own custom Java classes and methods, you have to be sure that an integer is passed when the Java method requires an integer. Not adding a specific example here, but it should be clear why you may want to use the L suffix in constants in these cases.

Addendum

The previous cases where about when you may want to use L. Even if I guess much less common, it might be useful to add a case in which you don't want the L. This may arise if there is danger of integer overflow. The *, + and - operators preserve the type if both the operand are integer. For example:

#this overflows
31381938L*3231L
#[1] NA
#Warning message:
#In 31381938L * 3231L : NAs produced by integer overflow

#this not
31381938L*3231
#[1] 1.01395e+11

So, if you are doing operations on an integer variable which might produce overflow, it's important to cast it to double to avoid any risk. Adding/subtracting to that variable a constant without the L might be a good occasion as any to make the cast.

R is duplicating large integers when reading in a data frame

Try using integer64:

library(bit64) 
as.integer64(x)

NA/NaN/Inf in foreign function error with mantelhaen.test()

I was able to replicate the problem by making the data set bigger.

set.seed(101); n <- 500000
db <- data.frame(education=
                   factor(sample(1:3,replace=TRUE,size=n)),
                 score=
                   factor(sample(1:5,replace=TRUE,size=n)),
                 sex=
                   sample(c("M","F"),replace=TRUE,size=n))

After this, mantelhaen.test(db$education, db$score, db$sex) gives the reported error.

Thankfully, the real problem is not within the guts of the QR decomposition code: rather, it occurs when setting up a matrix prior to QR decomposition. There are two computations, ntot*colsums and ntot*rowsums, that overflow R's capacity for integer computation. There's a relatively easy way to work around this by creating a modified version of the function:

copy the source code: dump("mantelhaen.test",file="my_mh.R")
edit the source code
- l. 1: modify name of function to my_mantelhaen.test (to avoid confusion)
- lines 199 and 200: change ntot to as.numeric(ntot), converting the integer to double precision before the overflow happens
source("my_mh.R") to read in the new function

Now

my_mantelhaen.test(db$education, db$score, db$sex)

should work.
You should definitely test the new function against the old function for cases where it works to make sure you get the same answer.

Now posted to the R bug list, we'll see what happens ...

update 11 May 2018: this is fixed in the development version of R (3.6 to be).

How does Java handle integer underflows and overflows and how would you check for it?

If it overflows, it goes back to the minimum value and continues from there. If it underflows, it goes back to the maximum value and continues from there.

You can check that beforehand as follows:

public static boolean willAdditionOverflow(int left, int right) {
    if (right < 0 && right != Integer.MIN_VALUE) {
        return willSubtractionOverflow(left, -right);
    } else {
        return (~(left ^ right) & (left ^ (left + right))) < 0;
    }
}

public static boolean willSubtractionOverflow(int left, int right) {
    if (right < 0) {
        return willAdditionOverflow(left, -right);
    } else {
        return ((left ^ right) & (left ^ (left - right))) < 0;
    }
}

(you can substitute int by long to perform the same checks for long)

If you think that this may occur more than often, then consider using a datatype or object which can store larger values, e.g. long or maybe java.math.BigInteger. The last one doesn't overflow, practically, the available JVM memory is the limit.

If you happen to be on Java8 already, then you can make use of the new Math#addExact() and Math#subtractExact() methods which will throw an ArithmeticException on overflow.

public static boolean willAdditionOverflow(int left, int right) {
    try {
        Math.addExact(left, right);
        return false;
    } catch (ArithmeticException e) {
        return true;
    }
}

public static boolean willSubtractionOverflow(int left, int right) {
    try {
        Math.subtractExact(left, right);
        return false;
    } catch (ArithmeticException e) {
        return true;
    }
}

The source code can be found here and here respectively.

Of course, you could also just use them right away instead of hiding them in a boolean utility method.

How to deal with integer(0) vectors being used as indices in a matrix

You could use setdiff():

A[, setdiff(1:ncol(A), b)]

This method can handle

b <- NA
b <- NULL
b <- integer(0)

and returns the entire data A.

What Is Integer Overflow in R and How Can It Happen