What is integer overflow in R and how can it happen?
You can answer many of your questions by reading the help page ?integer
. It says:
R uses 32-bit integers for integer vectors, so the range of
representable integers is restricted to about +/-2*10^9.
Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.
If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).
Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.
There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.
Update:
The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric
mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers
remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max
.
When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.
R: simple multiplication causes integer overflow
Hopefully a graphic representation of what is happening....
2614 * 1456000
#[1] 3805984000
## Integers are actually represented as doubles
class( 2614 * 1456000 )
#[1] "numeric"
# Force numbers to be integers
2614L * 1456000L
#[1] NA
#Warning message:
#In 2614L * 1456000L : NAs produced by integer overflow
## And the result is an integer with overflow warning
class( 2614L * 1456000L )
#[1] "integer"
#Warning message:
#In 2614L * 1456000L : NAs produced by integer overflow
2614 * 1456000
is a numeric
because both the operands are actually of class numeric
. The overflow occurs because both nrow
and length
return integer
's and hence the result is an integer but the result exceeds the maximum size representable by the integer
class (+/-2*10^9). A numeric
or double
can hold 2e-308 to 2e+308
. So to solve your problem, just use as.numeric(length(A))
or as.double(length(A))
.
What is an integer overflow error?
Integer overflow occurs when you try to express a number that is larger than the largest number the integer type can handle.
If you try to express the number 300 in one byte, you have an integer overflow (maximum is 255). 100,000 in two bytes is also an integer overflow (65,535 is the maximum).
You need to care about it because mathematical operations won't behave as you expect. A + B doesn't actually equal the sum of A and B if you have an integer overflow.
You avoid it by not creating the condition in the first place (usually either by choosing your integer type to be large enough that you won't overflow, or by limiting user input so that an overflow doesn't occur).
Error integer overflow - use sum(as.numeric(.))[1] NA in gapminder dataset
Dividing by 1 coerces it numeric.
library(gapminder)
class(sum(gapminder$pop))
[1] "integer"
Warning message:
In sum(gapminder$pop) : integer overflow - use sum(as.numeric(.))
class(sum(gapminder$pop/1))
[1] "numeric"
Benefits of using integer values for constants rather than numeric values (e.g. 1L vs 1) in R
These are some of the use cases in which I explicitly use the L
suffix in declaring the constants. Of course these are not strictly "canonical" (or the only ones), but maybe you can have an idea of the rationale behind. I added, for each case, a "necessary" flag; you will see that these arise only if you interface other languages (like C).
- Logical type conversion (not necessary)
Instead of using a classic as.integer
, I use adding 0L
to a logical vector to make it integer. Of course you could just use 0
, but this would require more memory (typically 8 bytes instead of four) and a conversion.
- Manipulating the result of a function that returns integer (not necessary)
Say for instance that you want to find to retrieve the elements of the vector after a NA
. You could:
which(is.na(vec)) + 1L
Since which
returns an integer
, adding 1L
will preserve the type and avoid an implicit conversion. Nothing will happen if you omit the L
, since it's just a small optimization. This happens also with match
for instance: if you want to post-process the result of such a function, it's good habit to preserve the type if possible.
- Interfacing C (necessary)
From ?integer
:
Integer vectors exist so that data can be passed to C or Fortran
code which expects them, and so that (small) integer data can be
represented exactly and compactly.
C is much stricter regarding data types. This implies that, if you pass a vector to a C function, you can not rely on C to do the conversions. Say that you want to replace the elements after a NA with some value, say 42. You find the positions of the NA values at the R level (as we did before with which
) and then pass the original vector and the vector of indices to C. The C function will look like:
SEXP replaceAfterNA (SEXP X, SEXP IND) {
...
int *ind = INTEGER(IND);
...
for (i=0; i<l; i++) {
//make here the replacement
}
}
and the from the R side:
...
ind <- which(is.na(x)) + 1L
.Call("replaceAfterNA", x, ind)
...
If you omit the L
in the first line of above, you will receive an error like:
INTEGER() cannot be applied to double vectors
since C is expecting an integer type.
- Interfacing Java (necessary)
Same as before. If you use the rJava
package and want R to call your own custom Java classes and methods, you have to be sure that an integer is passed when the Java method requires an integer. Not adding a specific example here, but it should be clear why you may want to use the L
suffix in constants in these cases.
Addendum
The previous cases where about when you may want to use L
. Even if I guess much less common, it might be useful to add a case in which you don't want the L
. This may arise if there is danger of integer overflow. The *
, +
and -
operators preserve the type if both the operand are integer. For example:
#this overflows
31381938L*3231L
#[1] NA
#Warning message:
#In 31381938L * 3231L : NAs produced by integer overflow
#this not
31381938L*3231
#[1] 1.01395e+11
So, if you are doing operations on an integer variable which might produce overflow, it's important to cast it to double
to avoid any risk. Adding/subtracting to that variable a constant without the L
might be a good occasion as any to make the cast.
R is duplicating large integers when reading in a data frame
Try using integer64:
library(bit64)
as.integer64(x)
NA/NaN/Inf in foreign function error with mantelhaen.test()
I was able to replicate the problem by making the data set bigger.
set.seed(101); n <- 500000
db <- data.frame(education=
factor(sample(1:3,replace=TRUE,size=n)),
score=
factor(sample(1:5,replace=TRUE,size=n)),
sex=
sample(c("M","F"),replace=TRUE,size=n))
After this, mantelhaen.test(db$education, db$score, db$sex)
gives the reported error.
Thankfully, the real problem is not within the guts of the QR decomposition code: rather, it occurs when setting up a matrix prior to QR decomposition. There are two computations, ntot*colsums
and ntot*rowsums
, that overflow R's capacity for integer computation. There's a relatively easy way to work around this by creating a modified version of the function:
- copy the source code:
dump("mantelhaen.test",file="my_mh.R")
- edit the source code
- l. 1: modify name of function to my_mantelhaen.test (to avoid confusion)
- lines 199 and 200: change
ntot
toas.numeric(ntot)
, converting the integer to double precision before the overflow happens
source("my_mh.R")
to read in the new function
Now
my_mantelhaen.test(db$education, db$score, db$sex)
should work.
You should definitely test the new function against the old function for cases where it works to make sure you get the same answer.
Now posted to the R bug list, we'll see what happens ...
update 11 May 2018: this is fixed in the development version of R (3.6 to be).
How does Java handle integer underflows and overflows and how would you check for it?
If it overflows, it goes back to the minimum value and continues from there. If it underflows, it goes back to the maximum value and continues from there.
You can check that beforehand as follows:
public static boolean willAdditionOverflow(int left, int right) {
if (right < 0 && right != Integer.MIN_VALUE) {
return willSubtractionOverflow(left, -right);
} else {
return (~(left ^ right) & (left ^ (left + right))) < 0;
}
}
public static boolean willSubtractionOverflow(int left, int right) {
if (right < 0) {
return willAdditionOverflow(left, -right);
} else {
return ((left ^ right) & (left ^ (left - right))) < 0;
}
}
(you can substitute int
by long
to perform the same checks for long
)
If you think that this may occur more than often, then consider using a datatype or object which can store larger values, e.g. long
or maybe java.math.BigInteger
. The last one doesn't overflow, practically, the available JVM memory is the limit.
If you happen to be on Java8 already, then you can make use of the new Math#addExact()
and Math#subtractExact()
methods which will throw an ArithmeticException
on overflow.
public static boolean willAdditionOverflow(int left, int right) {
try {
Math.addExact(left, right);
return false;
} catch (ArithmeticException e) {
return true;
}
}
public static boolean willSubtractionOverflow(int left, int right) {
try {
Math.subtractExact(left, right);
return false;
} catch (ArithmeticException e) {
return true;
}
}
The source code can be found here and here respectively.
Of course, you could also just use them right away instead of hiding them in a boolean
utility method.
How to deal with integer(0) vectors being used as indices in a matrix
You could use setdiff()
:
A[, setdiff(1:ncol(A), b)]
This method can handle
b <- NA
b <- NULL
b <- integer(0)
and returns the entire data A
.
Related Topics
Avoid String Printed to Console Getting Truncated (In Rstudio)
The Condition Has Length > 1 and Only the First Element Will Be Used in If Else Statement
How to Randomize (Or Permute) a Dataframe Rowwise and Columnwise
How to Make Gradient Color Filled Timeseries Plot in R
Get All Diagonal Vectors from Matrix
How to Change the First Row to Be the Header in R
R Extract Rows Where Column Greater Than 40
Factors in R: More Than an Annoyance
Similarity Scores Based on String Comparison in R (Edit Distance)
Splitting a File Name into Name,Extension
Calculate Cumsum() While Ignoring Na Values
What Is the Most Useful R Trick
Linear Regression with a Known Fixed Intercept in R
Plotting with Ggplot2: "Error: Discrete Value Supplied to Continuous Scale" on Categorical Y-Axis
Show Frequencies Along with Barplot in Ggplot2
Ggplot - Multiple Legends Arrangement