Should I Use a Data.Frame or a Matrix

Should I use a data.frame or a matrix?

Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.

Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.

The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.

Also:

Matrices are more memory efficient:

m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes

Matrices are a necessity if you plan to do any linear algebra-type of operations.

Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).

Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.

What are the differences between data.frame, tibble and matrix?

Because they serve different purposes.

Short summary:

  • Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.

  • Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.

  • Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.

Long description of matrices, data frames and other data structures as used in R.

So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.

If I try to rephrase it from a less technical perspective:
Each data structure is making tradeoffs.

  • Data frame is trading a little of its efficiency for convenience and clarity.
  • Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
  • Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.

Extract column from data.frame faster than from matrix - why?


data.frame

Consider the builtin data frame BOD. data frames are stored as a list of columns and the inspect output shown below shows the address of each of the two columns of BOD. We then assign its second column to BOD2. Note that the address of BOD2 is the same memory location as the second column shown in the inspect output for BOD. That is, all R did was have BOD2 point to memory within BOD in order to create BOD2. There was no data movement at all. Another way to see this is to compare the size of BOD, BOD2 and both together and we see that both together take up the same amount of memory as BOD so there must have been no copying. (Continued after code.)

library(pryr)

BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
## <REALSXP 0x4f81f48>
## <REALSXP 0x4f81ed8> <--- compare this address to address shown below
## ...snip...

BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"

object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2) # same as object_size(BOD) above
## 1.18 kB

matrix

Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. If we look at the memory used by a matrix m, an extracted column m2 and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying.

set.seed(123)

n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]

object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2)
## 240 kB <-- unlike for data.frames this equals sum of above

what to do

If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest.

Performing a for loop on a matrix instead of a data frame

Use a sparse matrix for the dummy encoding:

m <- as.matrix(df)

groups <- unique(as.vector(m[, grep("group", colnames(m))]))
tmp <- lapply(groups, function(x, m)
which((m[, "group_1"] == x | m[, "group_2"] == x) & m[, "exclude"] != x),
m = m)

j = rep(seq_along(tmp), lengths(tmp))
i = unlist(tmp)

library(Matrix)
dummies <- sparseMatrix(i, j, dims = c(nrow(m), length(groups)))
colnames(dummies) <- groups

M <- Matrix(as.matrix(df))
cbind(M, dummies)
#9 x 7 Matrix of class "dgeMatrix"
# response group_1 group_2 exclude 10001 10003 10002
#[1,] 5 10001 10002 10001 0 0 1
#[2,] 1 10001 10002 10001 0 0 1
#[3,] 2 10001 10002 10001 0 0 1
#[4,] 0 10003 10001 10003 1 0 0
#[5,] 4 10003 10001 10003 1 0 0
#[6,] 8 10003 10001 10003 1 0 0
#[7,] 7 10002 10003 10002 0 1 0
#[8,] 6 10002 10003 10002 0 1 0
#[9,] 0 10002 10003 10002 0 1 0

Why is running unique faster on a data frame than a matrix in R?


  1. In this implementation, unique.matrix is the same as unique.array

    > identical(unique.array, unique.matrix)

    [1] TRUE

  2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

    collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

    temp <- if (collapse)
    apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

  3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?

How to deal with Cronbach's alpha data must either be a data frame or a matrix error?

Here is one work around

df <- data.frame(df$column1, df$column2, df$column3, df$columns4)
alpha(df)

See ?alpha for detail.

The input alpha expects is a data.frame or matrix of data, or a covariance or correlation matrix. You can create a dataframe out of the desired columns and pass that to alpha with additional arguments if you want.



Related Topics



Leave a reply



Submit