How to Use Cast or Another Function to Create a Binary Table in R

How to use cast or another function to create a binary table in R

Original data:

x <- data.frame(id=c(1,1,2,3,3), region=factor(c(2,3,2,1,1)))

> x
id region
1 1 2
2 1 3
3 2 2
4 3 1
5 3 1

Group up the data:

aggregate(model.matrix(~ region - 1, data=x), x["id"], max)

Result:

  id region1 region2 region3
1 1 0 1 1
2 2 0 1 0
3 3 1 0 0

How to programmatically create binary columns based on a categorical variable in data.table?

data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
# id a b c d e
# 1: 1 0 1 1 1 1
# 2: 2 1 0 1 0 1
# 3: 3 1 0 1 1 1

Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]

All that's left is to fill existing combinations with 1L.

dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# id b d c e a
# 1: 1 1 1 1 1 0
# 2: 2 0 0 1 1 1
# 3: 3 0 1 1 1 1

Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction()) # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

where,

AnsFunction <- function() {
ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# reorder columns outside
}

dcastFunction <- function() {
# no need to load reshape2. data.table has its own dcast as well
# no need for setDT
df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
# need to return integer results for identical results
# fixed 1 -> 1L; as.numeric -> as.integer
df <- as.data.frame.matrix(table(dt$id, dt$y))
df[df > 1L] <- 1L
df <- cbind(id = as.integer(row.names(df)), df)
setDT(df)
}

R, change a character string in dataframe to binary values

Just use ifelse

#your data
data = data.frame(Landtype = c("Rural", "Urban", "Rural", "Urban"))
#ifelse condition
data$Landtype = ifelse(data$Landtype == "Rural", 1,0)

selecting columns using a binary table in R

You can use apply to iterate over the columns of a binary matrix, bin, sub-settings a dataframe, dat:

# create test data
set.seed(1)
dat <- as.data.frame(matrix(rnorm(18), nrow=2))
colnames(dat) <- paste0('c', 1:9)

dat
# c1 c2 c3 c4 c5 c6 c7 c8
# 1 -0.6264538 -0.8356286 0.3295078 0.4874291 0.5757814 1.5117812 -0.6212406 1.12493092
# 2 0.1836433 1.5952808 -0.8204684 0.7383247 -0.3053884 0.3898432 -2.2146999 -0.04493361
# c9
# 1 -0.01619026
# 2 0.94383621

bin <- matrix(sample(0:1, 27, replace = TRUE), nrow = 9)

bin
# [,1] [,2] [,3]
# [1,] 1 1 0
# [2,] 0 0 0
# [3,] 1 0 0
# [4,] 0 1 1
# [5,] 1 1 1
# [6,] 1 0 0
# [7,] 1 1 1
# [8,] 1 0 0
# [9,] 1 0 0

# subset columns of dat, using binary vector columns defined in bin;
# drop = FALSE is included to prevent any columns with only a single "1" from
# being cast to a vector
apply(bin, 2, function(x) { dat[, as.logical(x), drop = FALSE] })
# [[1]]
# c1 c3 c5 c6 c7 c8 c9
# 1 -0.6264538 0.3295078 0.5757814 1.5117812 -0.6212406 1.12493092 -0.01619026
# 2 0.1836433 -0.8204684 -0.3053884 0.3898432 -2.2146999 -0.04493361 0.94383621
#
# [[2]]
# c1 c4 c5 c7
# 1 -0.6264538 0.4874291 0.5757814 -0.6212406
# 2 0.1836433 0.7383247 -0.3053884 -2.2146999
#
# [[3]]
# c4 c5 c7
# 1 0.4874291 0.5757814 -0.6212406
# 2 0.7383247 -0.3053884 -2.2146999
#

R - Function to make a binary variable

You can use :

df[] <- +(df == 4 | df == 5)
df
# var1 var2 var3
#1 0 0 NA
#2 1 0 1
#3 0 1 1
#4 0 1 0

Comparison of df == 4 | df == 5 returns logical values (TRUE/FALSE), + here turns those logical values to integer values (1/0) respectively.

If you want to apply this for selected columns you can subset the columns by position or by name.

cols <- 1:3 #Position
#cols <- grep('var', names(df)) #Name
df[cols] <- +(df[cols] == 4 | df[cols] == 5)

As far as your function is concerned you can do :

making_binary <- function (var){
var <- as.integer(var >= 4)
#which is faster version of
#var <- ifelse(var >= 4, 1, 0)
return(var)
}

df[] <- lapply(df, making_binary)

data

df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L, 
5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))

Reshape data in R, cast function arguments

The OP asked for help with the arguments to the cast() function of the reshape package. However, the reshape package was superseded by the reshape2 package from the same package author. According to the package description, the reshape2 package is

A Reboot of the Reshape Package

Using reshape2, the desired result can be produced with

reshape2::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length, 
value.var = "TARGET_TYPE")
# PARENT_MOL_CHEMBL_ID ABL EGFR TP53
#1 C10 1 1 0
#2 C939 0 0 1

BTW: The data.table package has implemented (and enhanced) dcast() as well. So, the same result can be produced with

data.table::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length, 
value.var = "TARGET_TYPE")


Additional columns

The OP mentioned other columns in the data frame which should be shown together with the spread or wide data. Unfortunately, the OP hasn't supplied particular sample data, so we have to consider two use cases.

Case 1: Additional columns go along with the id column

The data could look like

wc
# PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1
#1 C10 ABL a
#2 C10 EGFR a
#3 C939 TP53 b

Note that the values in extra_col1 are in line with PARENT_MOL_CHEMBL_ID.

This is an easy case, because the formula in dcast() accepts ... which represents all other variables not used in the formula:

reshape2::dcast(wc, ... ~ TARGET_TYPE, fun.aggregate = length, 
value.var = "TARGET_TYPE")
# PARENT_MOL_CHEMBL_ID extra_col1 ABL EGFR TP53
#1 C10 a 1 1 0
#2 C939 b 0 0 1

The resulting data.frame does contain all other columns.

Case2: Additional columns don't go along with the id column

Now, another column is added:

wc
# PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1 extra_col2
#1 C10 ABL a 1
#2 C10 EGFR a 2
#3 C939 TP53 b 3

Note that extra_col2 has two different values for C10. This will cause the simple approach to fail. So, a two step approach has to be implemented: reshaping first and joining afterwards with the original data frame. The data.table package is used for both steps, now:

library(data.table)
# reshape from long to wide, result has only one row per id column
wide <- dcast(setDT(wc), PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,
value.var = "TARGET_TYPE")
# right join, i.e., all rows of wc are included
wide[wc, on = "PARENT_MOL_CHEMBL_ID"]
# PARENT_MOL_CHEMBL_ID ABL EGFR TP53 TARGET_TYPE extra_col1 extra_col2
#1: C10 1 1 0 ABL a 1
#2: C10 1 1 0 EGFR a 2
#3: C939 0 0 1 TP53 b 3

The result shows the aggregated values in wide format together with any other columns.

How to convert two character columns to a binary matrix?

You can use:

library(tidyverse)
df %>%
pivot_wider(y,
names_from = x,
values_from = x,
values_fn = list(x = length),
values_fill = list(x = 0))

y A B C
<chr> <int> <int> <int>
1 m 1 0 0
2 n 1 0 0
3 o 0 1 0
4 p 0 0 1
5 q 0 0 1
6 r 0 0 1


Related Topics



Leave a reply



Submit