Tidyr Spread Function Generates Sparse Matrix When Compact Vector Expected

tidyr spread function generates sparse matrix when compact vector expected

The key here is that spread doesn't aggregate the data.

Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
    unite(S,A,P)
a
##             S Freq
## 1 FALSE_FALSE    1
## 2  FALSE_TRUE    1
## 3  TRUE_FALSE    1
## 4   TRUE_TRUE    1
## 5  TRUE_FALSE    1

a %>% spread(S, Freq)
##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1           1         NA         NA        NA
## 2          NA          1         NA        NA
## 3          NA         NA          1        NA
## 4          NA         NA         NA         1
## 5          NA         NA          1        NA

Which wouldn't make sense any other way (without aggregation).

This is predictable based on the help file for the fill parameter:

If there isn't a value for every combination of the other variables
and the key column, this value will be substituted.

In your case, there aren't any other variables to combine with the key column. Had there been, then...

b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                , h = rep(c("foo", "bar"), length.out = 5)) %>% 
    unite(S,A,P)
b
##             S Freq   h
## 1 FALSE_FALSE    1 foo
## 2  FALSE_TRUE    1 bar
## 3  TRUE_FALSE    1 foo
## 4   TRUE_TRUE    1 bar
## 5  TRUE_FALSE    1 foo

> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)

...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
## 
##     h           S Freq
## 1 bar  FALSE_TRUE    1
## 2 bar   TRUE_TRUE    1
## 3 foo FALSE_FALSE    1
## 4 foo  TRUE_FALSE    2

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq)) %>%
    spread(S, Freq)
## Source: local data frame [2 x 5]
## 
##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar          NA          1         NA         1
## 2 foo           1         NA          2        NA

Why am I getting repeat rows with NAs using tidyr's spread function?

You could do price and cost separately and then merge (join) them (or cbind them, depending on the specifics of your data):

x <- read.table(text = "Date    State    Price.Name    Cost.Name   Price    Cost
+ Jan       AZ    firm1.price   firm1.cost    100       50
+                 Jan       AZ    firm2.price   firm2.cost    200      100",header = TRUE,sep = "")
> x %>% select(-Cost,-Cost.Name) %>% spread(Price.Name,Price)
  Date State firm1.price firm2.price
1  Jan    AZ         100         200
> x %>% select(-Price,-Price.Name) %>% spread(Cost.Name,Cost)
  Date State firm1.cost firm2.cost
1  Jan    AZ         50        100

tidyr::spread() without creating separate rows?

I think you want something like this:

library(dplyr)
library(tidyr)
answer = 
  babynames %>%
  filter(name == "Kerry") %>%
  group_by(year, sex) %>%
  summarize(n = sum(n)) %>%
  spread(sex, n, fill = 0)

Replace the first few observations of a sparse matrix

Here is a tidy solution.

dat_sparse <- dat %>% 
  as_tibble() %>%
  count(col1, col2) %>%
  spread(col2, n, fill = 0) %>%
  column_to_rownames("col1") %>%
  as.matrix() %>%
  Matrix(., sparse = TRUE)

dat_sparse

Output:

group 1 . . . 1 . 1 . . 1 . . . . . . 1 1 . . 1 . . . . . . . . .
group 2 . 1 . . . . . . . 1 1 . . . 1 . . 1 1 . . . . 1 . . . 1 .
group 3 1 . 1 . . . 1 1 . . . 1 1 1 . . . . . . . 1 . . 1 1 . . 1
group 4 . . . . 1 . . . . . . . . . . . . . . . 1 . . . . . 1 . .
group 5 . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . .

Loop through each column and row, do stuff

How about this:

df.new = as.data.frame(lapply(df, function(x) ifelse(is.na(x), 0, 1)))

lapply applies a function to each column of the data frame df. In this case, the function does the 0/1 replacement. lapply returns a list. Wrapping it in as.data.frame converts the list to a data frame (which is a special type of list).

In R you can often replace a loop with one of the *apply family of functions. In this case, lapply "loops" over the columns of the data frame. Also, many R functions are "vectorized" meaning the function operates on every value in a vector at once. In this case, ifelse does the replacement on an entire column of the data frame.

Tidyr Spread Function Generates Sparse Matrix When Compact Vector Expected