Reshape Three Column Data Frame to Matrix ("Long" to "Wide" Format)

Reshape three column data frame to matrix (long to wide format)

There are many ways to do this. This answer starts with what is quickly becoming the standard method, but also includes older methods and various other methods from answers to similar questions scattered around this site.

tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))

Using the tidyverse:

The new cool new way to do this is with pivot_wider from tidyr 1.0.0. It returns a data frame, which is probably what most readers of this answer will want. For a heatmap, though, you would need to convert this to a true matrix.

library(tidyr)
pivot_wider(tmp, names_from = y, values_from = z)
## # A tibble: 2 x 4
## x a b c
## <fct> <dbl> <dbl> <dbl>
## 1 x 1 2 3
## 2 y 3 3 2

The old cool new way to do this is with spread from tidyr. It similarly returns a data frame.

library(tidyr)
spread(tmp, y, z)
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2

Using reshape2:

One of the first steps toward the tidyverse was the reshape2 package.

To get a matrix use acast:

library(reshape2)
acast(tmp, x~y, value.var="z")
## a b c
## x 1 2 3
## y 3 3 2

Or to get a data frame, use dcast, as here: Reshape data for values in one column.

dcast(tmp, x~y, value.var="z")
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2

Using plyr:

In between reshape2 and the tidyverse came plyr, with the daply function, as shown here: https://stackoverflow.com/a/7020101/210673

library(plyr)
daply(tmp, .(x, y), function(x) x$z)
## y
## x a b c
## x 1 2 3
## y 3 3 2

Using matrix indexing:

This is kinda old school but is a nice demonstration of matrix indexing, which can be really useful in certain situations.

with(tmp, {
out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
dimnames=list(levels(x), levels(y)))
out[cbind(x, y)] <- z
out
})

Using xtabs:

xtabs(z~x+y, data=tmp)

Using a sparse matrix:

There's also sparseMatrix within the Matrix package, as seen here: R - convert BIG table into matrix by column names

with(tmp, sparseMatrix(i = as.numeric(x), j=as.numeric(y), x=z,
dimnames=list(levels(x), levels(y))))
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## x 1 2 3
## y 3 3 2

Using reshape:

You can also use the base R function reshape, as suggested here: Convert table into matrix by column names, though you have to do a little manipulation afterwards to remove an extra columns and get the names right (not shown).

reshape(tmp, idvar="x", timevar="y", direction="wide")
## x z.a z.b z.c
## 1 x 1 2 3
## 4 y 3 3 2

creating matrix from three column data frame in R

You may define the fun.aggregate=.

library(reshape2)
acast(df1, state~season, value.var = 'val_1', fun.aggregate=sum)
# spring summer winter
# BOS 26 19 14
# NY 10 24 3
# WASH 66 42 99

How to reshape data from long to wide format

Using reshape function:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

Convert a three column dataframe into matrix

You could do

xtabs(freq~., df1)
# Rptname
#Gene Cha1 Cha2 Cha3 Cha4 Cha5 KI2 TH1
# Gene1 1 1 1 1 0 0 0
# Gene2 1 0 0 2 4 2 2

Or

library(reshape2)
acast(df1, Gene~Rptname, value.var='freq', fill=0)
# Cha1 Cha2 Cha3 Cha4 Cha5 KI2 TH1
#Gene1 1 1 1 1 0 0 0
#Gene2 1 0 0 2 4 2 2

Or use spread to get a 'data.frame' output

library(tidyr)
spread(df1, Rptname, freq, fill=0)

If your original dataset is 'dat', this could be also done using table

table(dat[c('Gene', 'Rptname')])

Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc

You can create a sequence column ('.id') using getanID from splitstackshape and use dcast from data.table to convert the long format to wide format. The output of splitstackshape is a data.table. When we load splitstackshape, data.table will also be loaded. So, if you already have the devel version of data.table, then the dcast from data.table can be used as well.

library(splitstackshape)
dcast(getanID(df1, 'time'), time~.id, value.var='measure')
# time 1 2 3 4 5
#1: 2001 Q1 0.1468068 0.53593193 0.5609797 NA NA
#2: 2001 Q2 -1.4810269 0.18150972 NA NA NA
#3: 2001 Q3 1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

Update

As @snoram mentioned in the comments, function rowid from data.table makes it easier to use just data.table alone

library(data.table)
dcast(setDT(df1), time ~ rowid(time), value.var = "measure")

Reshape 3 Column Data with ID

I'll take a stab at this. To start, we'll make a reproducible dataset from the example data you posted:

df <- structure(list(SOURCE_SUBREDDIT = c("rddtgaming", "xboxone", 
"ps4", "fitnesscirclejerk", "fitnesscirclejerk", "fitnesscirclejerk",
"cancer", "jleague", "bestoftldr", "quityourbullshit"), TARGET_SUBREDDIT = c("rddtrust",
"battlefield_4", "battlefield_4", "leangains", "lifeprotips",
"leangains", "fuckcancer", "soccer", "tifu", "pics"), LINK_SENTIMENT = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 10L), class = "data.frame")

Note that fitnesscirclejerk is associated with leangains twice, which is a feature you mentioned occurs in your data:

df

SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 rddtgaming rddtrust 1
2 xboxone battlefield_4 1
3 ps4 battlefield_4 1
4 fitnesscirclejerk leangains 1
5 fitnesscirclejerk lifeprotips 1
6 fitnesscirclejerk leangains 1
7 cancer fuckcancer 1
8 jleague soccer 1
9 bestoftldr tifu 1
10 quityourbullshit pics 1

Now, the goal is to spread this from long-format to wide-format, as in the example image you posted. As you already determined, the identical rows (rows 4 and 6) pose a problem when trying to spread:

tidyr::spread(df, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 4, 6
Do you need to create unique ID with tibble::rowid_to_column()?

Since you want to keep the same number of rows when spreading, we can get around this by adding a unique ID to each row, so each row is unique. You do that with splitstackshape::getanID, but we can also do that with tidyverse packages:

df2 <- dplyr::mutate(df, rowid = dplyr::row_number())
df2 <- tibble::rowid_to_column(df)

Both of these give us this data.frame, which I am assuming is similar to your mydata_id:

df2

rowid SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1 1 rddtgaming rddtrust 1
2 2 xboxone battlefield_4 1
3 3 ps4 battlefield_4 1
4 4 fitnesscirclejerk leangains 1
5 5 fitnesscirclejerk lifeprotips 1
6 6 fitnesscirclejerk leangains 1
7 7 cancer fuckcancer 1
8 8 jleague soccer 1
9 9 bestoftldr tifu 1
10 10 quityourbullshit pics 1

Now, when we spread, the existence of the unique ID column keeps R from combining (or trying to combine) the rows with identical subreddit pairs:

df3 <- tidyr::spread(df2, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
df3

rowid SOURCE_SUBREDDIT battlefield_4 fuckcancer leangains lifeprotips pics rddtrust soccer tifu
1 1 rddtgaming 0 0 0 0 0 1 0 0
2 2 xboxone 1 0 0 0 0 0 0 0
3 3 ps4 1 0 0 0 0 0 0 0
4 4 fitnesscirclejerk 0 0 1 0 0 0 0 0
5 5 fitnesscirclejerk 0 0 0 1 0 0 0 0
6 6 fitnesscirclejerk 0 0 1 0 0 0 0 0
7 7 cancer 0 1 0 0 0 0 0 0
8 8 jleague 0 0 0 0 0 0 1 0
9 9 bestoftldr 0 0 0 0 0 0 0 1
10 10 quityourbullshit 0 0 0 0 1 0 0 0

As you can see, the output of this mirrors the format of your desired output image and preserves both the order of the relationship and duplicate rows.

Reshape from long to wide according to the number of occurrence of one variable

You can try something like this:

df1 %>% 
group_by(person, visitID) %>%
summarise(across(matches("v[0-9]+"), list)) %>%
group_by(person) %>%
mutate(visit = seq_len(n()) %>% str_c("visit.", .)) %>%
ungroup() %>%
pivot_wider(
id_cols = person,
names_from = visit,
values_from = c("visitID", matches("v[0-9]+"))
)

replace list with ~str_c(.x, collapse = ",") if you want to have it in character style.

Convert dataframe from wide format to long format with key stored in row R

Here is one implementation. We can split the tibble into a list of tibble based on the last row, loop through the list with imap, rename the colums to same column names ('a', 'b'), mutate to create the column 'c' with the list name and bind the rows

library(tidyverse)
df %>%
slice(-n()) %>%
split.default(df %>%
slice(n()) %>%
flatten_chr) %>%
imap_dfr(~ .x %>%
rename_all(~ c('a', 'b')) %>%
mutate(c = .y))
# A tibble: 20 x 3
# a b c
# <chr> <chr> <chr>
# 1 5 5 G1
# 2 3 3 G1
# 3 5 5 G1
# 4 6 6 G1
# 5 2 2 G1
# 6 8 8 G2
# 7 2 2 G2
# 8 6 6 G2
# 9 4 4 G2
#10 1 1 G2
#11 9 9 G3
#12 3 3 G3
#13 7 7 G3
#14 8 8 G3
#15 4 4 G3
#16 6 6 G4
#17 5 5 G4
#18 2 2 G4
#19 1 1 G4
#20 8 8 G4


Related Topics



Leave a reply



Submit