Why Do Logicals (Booleans) in R Require 4 Bytes

Why do logicals (booleans) in R require 4 bytes?

Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

Basically, a logical vector is identical to an integer vector, so sum and other algorithms for integers work pretty much unchanged on logical vectors.

In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...

@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...

Memory size of logical variable

Compare below:

object.size(logical())
# 48 bytes

object.size(TRUE)
# 40 bytes

See post by Hadley about Memory for more info.

Every length 0 vector occupies 40 bytes of memory.

Why does boolean consume more memory than char?

It is a question of memory alignment. 4-byte variables work faster than 2-byte ones. This is the reason why you should use int instead of byte or short for counters and the like.

You should use 2-byte variables only when memory is a bigger concern than speed. And this is the reason why char (which is Unicode in .NET) takes two bytes instead of four.

Problem with automatically cast Logical vector to integer

A logical in R has the same bytes per element as an integer (4 bytes). This is different than C, where a bool has 1 byte* and an int has 4 bytes. The reason R does this is probably because in this approach, up-casting logical to integer is instantaneous and vector multiplication between logical and integer has no overhead.

What you're doing in both cases is to access the pointer to the start of the vector and set the first 4 bytes to the value that would correspond to 77.

On the R side, the variable named "x" still points to the same underlying data. But since you changed the underlying data, the value of the x data now has bytes that correspond to an int of 77.

An int of 77 doesn't mean anything as a logical since it can't happen in basic operation. So really, what R does when you force an impossible value is basically unknown.

A logical in R can only have three values: TRUE (corresponds to a value of 1), FALSE (corresponds to a value of 0) and NA (corresponds to a value of -2147483648).

*(Technically, implementation defined but I've only seen it as 1 byte)

Save storage space for small integers or factors with few levels

Since you mention raw (and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:

f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes

f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes

# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE

You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw and never go back to factors (except for presentation).

If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.


Here's a starting point for your byte.factor class:

byte.factor = function(f) {
res = as.raw(f)
attr(res, "levels") <- levels(f)
attr(res, "class") <- "byte.factor"
res
}

as.factor.byte.factor = function(b) {
factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}

So you can do things like:

f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"

as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Check out how data.table overrides rbind.data.frame if you want to make as.factor generic and just add whatever functions you want to add. Should all be quite straightforward.

Does the complexity of summing booleans differ from summing integers?

If you're adding up k boolean (0/1) values, you're essentially just counting up how many 1's there are. This can be done in time O(k) by just scanning over the numbers and counting upward for each 1 you read.

Now, imagine that you're adding up k b-bit values. This will require k additions, but each addition will take time proportional to the number of digits in the numbers that you're summing up. Imagine you're adding the total T to some number in the list x. Since x is b bits long, you'll spend O(b) work adding those bits into the total T. You may then need to do a ripple carry across some of the bits of T. Since each bit represents the next power of two, it becomes progressively less and less likely that you'll flip those higher bits, so overall the work done works out to O(kb). Notice that this is asymptotically slower than O(k) for large b.

Now, most machines can perform arithmetic on words of some fixed size w in time O(1). Therefore, if you do the arithmetic on an actual machine, the true runtime will be O(kb / w), so if you pick a number of bits that matches the computer's word size, this is asymptotically no better than counting the number of 1's in the boolean case. However, if you're dealing with arbitrary-precision numbers, summing booleans is definitely more efficient.

Ex: Computing a vector of logicals that is true when x[i]0

Long but hopefully easy to understand.

for(i in 1:length(x)){

if(x[i] > 0){
y[i] <- TRUE

} else {
y[i] <- FALSE
}
}

Sizes of integer vectors in R

Even if you try N <- 10000, all values occur exactly twice, except for vectors of length :

  • 5 to 8 (56 bytes)
  • 9 to 12 (72 bytes)
  • 13 to 16 (88 bytes)
  • 17 to 32 (152 bytes)

The fact that the number of bytes occurs twice, comes from the simple fact that the memory is allocated in pieces of 8 bytes (referred to as Vcells in ?gc ) and integers take only 4 bytes.

Next to that, the internal structure of objects in R makes a distinguishment between small and large vectors for allocating memory. Small vectors are allocated in bigger blocks of about 2Kb, whereas larger vectors are allocated individually. The ‘small’ vectors consist of 6 defined classes, based on length, and are able to store vector data of up to 8, 16, 32, 48, 64 and 128 bytes. As an integer takes only 4 bytes, you have 2, 4, 8, 12, 16 and 32 integers you can store in these 6 classes. This explains the pattern you see.

The extra number of bytes is for the header (which forms the Ncells in ?gc). If you're really interested in all this, read through the R Internals manual.

And, as you guessed, the 24 extra bytes are from the headers (or Ncells ). It's in fact a bit more complicated than that, but the exact details can be found in the R internals manual



Related Topics



Leave a reply



Submit