Differencebetween a List and a Pairlist in R

What is the difference between a list and a pairlist in R?

Pairlists in day to day R

There are two places that pair lists will show up commonly in day to day R. One is as function formals:

str(formals(var))

The other is as language objects. For example:

quote(1 + 1)

produces a pairlist of type language (LANGSXP internally). The principal reason why you would even care about being aware of this is that operations such as length(<language object>) or language_object[[x]] can be slow because of how pairlist are stored internally (though long pairlist language objects are somewhat rare; note expressions are not pairlists).

Note that empty elements are just zero length symbols, and you can actually store them in lists if you cheat a bit (though you probably shouldn't do this):

list(x=substitute(x, alist(x=)))  # hack alert

All that said, for the most part, OP is correct that you don't need to worry about pairlists too much unless you are writing C code for use in R.

Internal differences between lists and pairlists

Pairlists and list are different principally in their storage structure. Pairlists are stored as a chain of nodes, where each node points to the location of the next node in addition to the node's contents and the node's "name" (See CAR/CDR wiki article for generic discussion). Among other things this means you can't know how many elements there are in a pairlist unless you know what element is the first one, and you then traverse the entire list.

Pairlists are used extensively in the R internals, and do exist in normal R use, but most of the time are disguised by the print or access methods and/or coerced to lists when accessed.

Lists are also a list of addresses, but unlike pairlists, all the addresses are stored in one contiguous memory location and the total length is tracked. This makes it easy to access any arbitrary member of the list by location since you can just look up the address in the memory table. With a pairlist, you would have to jump from node to node until you eventually got to the desired node. Names are also stored as attributes of the list proper, instead of being attached to each node of a pairlist.

Benefits of pairlists

One (generally small) benefit of pairlists is that you can add to them with minimal overhead since you only need modify at most two nodes (the node ahead of the new node, and the new node itself), whereas with a list you may need to re-allocate the entire address table with an increase in size (this is typically not much of an issue since the address table is usually very small compared to the size of the data the table points to). There are also many algorithms that specialize in pairlist manipulation (e.g. sorting, indexing, etc.), but those can be ported to normal lists as well.

Less relevant for day-to-day use since you can only do this in internals, it is very easy to modify list from a programming perspective by changing what any arbitrary element points to.

Loosely related to the above, pairlists are likely be more efficient when you have highly nested objects. lists can easily replicate this structure, but each list and nested list will be saddled with the extra memory address table. This is likely the reason pairlists are used for language objects that very likely have a high nesting / element ratio.

For more details see R Internals (look for LISTSXP and VECSXP, pairlists and lists respectively, in the linked location).

edit: interestingly an experiment to compare the memory footprint of a list to a pairlist shows the pairlist to be larger, so the storage efficiency argument may be incorrect (not sure if object.size can be trusted here):

> plist_to_list <- function(x) {
+ if(is.call(x)) x <- as.list(x)
+ if(length(x) > 1) for(i in 2:length(x)) x[[i]] <- Recall(x[[i]])
+ x
+ }
> add_quote <- function(x, y) call("+", x, y)
> x <- Reduce(add_quote, lapply(letters, as.name))
> object.size(x)
7056 bytes
> y <- plist_to_list(x)
> object.size(y)
4656 bytes

Pairwise Difference Between Pairs of Observations within an R data frame

Reshape to long so that you can merge on PairElement as a single column, do the merge, order it in reverse, get the difference within each PairID:

tmp <- merge(
reshape(one, idvar="PairID", sep="", varying=-1, direction="long"),
two,
by = "PairElement"
)
tmp <- tmp[order(tmp$PairID, -tmp$time),]
aggregate(cbind(var1,var2,var3,var4) ~ PairID, data = tmp, FUN=diff)

# PairID var1 var2 var3 var4
#1 1 0 -4 -4 2
#2 2 9 -1 -3 -9
#3 3 5 4 -2 -2
#4 4 -2 3 3 7
#5 5 7 6 4 -4
#6 6 -7 2 -3 5

In dplyr/tidyr speak, something like:

library(dplyr)
library(tidyr)

one %>%
pivot_longer(-PairID, values_to="PairElement") %>%
right_join(two, by="PairElement") %>%
group_by(PairID) %>%
arrange(desc(name)) %>%
select(-name, -PairElement) %>%
summarise_all(diff)

What kind of object is `...`?

What an interesting question!

Dot-dot-dot ... is an object (John Chambers is right!) and it's a type of pairlist. Well, I searched the documentation, so I'd like to share it with you:

R Language Definition document says:

The ‘...’ object type is stored as a type of pairlist. The components of ‘...’ can be accessed in the usual pairlist manner from C code, but is not easily accessed as an object in interpreted code. The object can be captured as a list.

Another chapter defines pairlists in detail:

Pairlist objects are similar to Lisp’s dotted-pair lists.

Pairlists are handled in the R language in exactly the same way as generic vectors (“lists”).

Help on Generic and Dotted Pairs says:

Almost all lists in R internally are Generic Vectors, whereas traditional dotted pair lists (as in LISP) remain available but rarely seen by users (except as formals of functions).

And a nice summary is here at Stack Overflow!

When to use pairlists in R?

To answer your second question, I don't think so. Section 2.1.11 from R documentation states this:

Pairlists are handled in the R language in exactly the same way as generic vectors (“lists”). In particular, elements are accessed using the same [[]] syntax. The use of pairlists is deprecated since generic vectors are usually more efficient to use. When an internal pairlist is accessed from R it is generally (including when subsetted) converted to a generic vector.

Pair all columns from dataframe into a list in R

Use ?combn with a function to be called for each combination of the data.frame's names.

df1 <- mtcars[1:4]

combn(names(df1), 2, function(x){
d <- df1[x]
names(d) <- x
d
}, simplify = FALSE)

How to correctly use lists?

Just to address the last part of your question, since that really points out the difference between a list and vector in R:

Why do these two expressions not return the same result?

x = list(1, 2, 3, 4); x2 = list(1:4)

A list can contain any other class as each element. So you can have a list where the first element is a character vector, the second is a data frame, etc. In this case, you have created two different lists. x has four vectors, each of length 1. x2 has 1 vector of length 4:

> length(x[[1]])
[1] 1
> length(x2[[1]])
[1] 4

So these are completely different lists.

R lists are very much like a hash map data structure in that each index value can be associated with any object. Here's a simple example of a list that contains 3 different classes (including a function):

> complicated.list <- list("a"=1:4, "b"=1:3, "c"=matrix(1:4, nrow=2), "d"=search)
> lapply(complicated.list, class)
$a
[1] "integer"
$b
[1] "integer"
$c
[1] "matrix"
$d
[1] "function"

Given that the last element is the search function, I can call it like so:

> complicated.list[["d"]]()
[1] ".GlobalEnv" ...

As a final comment on this: it should be noted that a data.frame is really a list (from the data.frame documentation):

A data frame is a list of variables of the same number of rows with unique row names, given class ‘"data.frame"’

That's why columns in a data.frame can have different data types, while columns in a matrix cannot. As an example, here I try to create a matrix with numbers and characters:

> a <- 1:4
> class(a)
[1] "integer"
> b <- c("a","b","c","d")
> d <- cbind(a, b)
> d
a b
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
[4,] "4" "d"
> class(d[,1])
[1] "character"

Note how I cannot change the data type in the first column to numeric because the second column has characters:

> d[,1] <- as.numeric(d[,1])
> class(d[,1])
[1] "character"

Mean difference between pairs of valid cases

Something like this?

func <- function(...) {
dots <- na.omit(c(...))
sum(abs(diff(c(dots, dots[1]))), na.rm = TRUE) / length(dots)
}
df %>%
mutate(meandiff = mapply(func, var1, var2, var3))
# # A tibble: 5 x 6
# id var1 var2 var3 var4 meandiff
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 1 5 11 2.67
# 2 2 NA 2 8 22 6
# 3 3 3 NA 6 33 3
# 4 4 1 NA NA 44 0
# 5 5 2 2 NA 55 0

(This calculates var3 - var1 for the third mid-sum value instead of your var1 - var3, but since you use abs it should not matter.)

Maximizing unique pairs including all elements from both lists

Here is an approach inspired by minimum spanning tree problem:

library(igraph)
g <- graph_from_data_frame(df, FALSE)
MST <- mst(g)

#get leaf nodes
leaves <- which(degree(MST, v=V(MST))==1L, useNames=TRUE)

#get neighbour to each leaf node in a greedy manner starting with leaf nodes with least neighbours
ans <- sapply(adjacent_vertices(MST, sort(leaves)), function(gph) names(gph))

data.table(col1=ifelse(ans %in% df$col1, ans, names(ans)),
col2=ifelse(ans %in% df$col2, ans, names(ans)))[order(col1)]

Approach assumes that there are a lot of connections so that there will be at least one for each node.

To OP, I will be interested to find out the edge cases where this will fail. Please keep me posted.

output:

   col1 col2
1: A Z
2: B Y
3: C X


Related Topics



Leave a reply



Submit