﻿ How to Use Cast or Another Function to Create a Binary Table in R - ITCodar

# How to Use Cast or Another Function to Create a Binary Table in R

## How to use cast or another function to create a binary table in R

Original data:

``x <- data.frame(id=c(1,1,2,3,3), region=factor(c(2,3,2,1,1)))> x  id region1  1      22  1      33  2      24  3      15  3      1``

Group up the data:

``aggregate(model.matrix(~ region - 1, data=x), x["id"], max)``

Result:

``  id region1 region2 region31  1       0       1       12  2       0       1       03  3       1       0       0``

## How to programmatically create binary columns based on a categorical variable in data.table?

data.table has its own `dcast` implementation using data.table's internals and should be fast. Give this a try:

``dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)#    id a b c d e# 1:  1 0 1 1 1 1# 2:  2 1 0 1 0 1# 3:  3 1 0 1 1 1``

Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

``ans = data.table(id = unique(dt\$id))[, unique(dt\$y) := 0L][]``

All that's left is to fill existing combinations with `1L`.

``dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]ans#    id b d c e a# 1:  1 1 1 1 1 0# 2:  2 0 0 1 1 1# 3:  3 0 1 1 1 1``

Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

``require(data.table)set.seed(45L)y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))system.time(ans1 <- AnsFunction())   # 2.3ssystem.time(ans2 <- dcastFunction()) # 2.2ssystem.time(ans3 <- TableFunction()) # 6.2ssetcolorder(ans1, names(ans2))setcolorder(ans3, names(ans2))setorder(ans1, id)setkey(ans2, NULL)setorder(ans3, id)identical(ans1, ans2) # TRUEidentical(ans1, ans3) # TRUE``

where,

``AnsFunction <- function() {    ans = data.table(id = unique(dt\$id))[, unique(dt\$y) := 0L][]    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]    ans    # reorder columns outside}dcastFunction <- function() {    # no need to load reshape2. data.table has its own dcast as well    # no need for setDT    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")}TableFunction <- function() {    # need to return integer results for identical results    # fixed 1 -> 1L; as.numeric -> as.integer    df <- as.data.frame.matrix(table(dt\$id, dt\$y))    df[df > 1L] <- 1L    df <- cbind(id = as.integer(row.names(df)), df)    setDT(df)}``

## R, change a character string in dataframe to binary values

Just use `ifelse`

``#your datadata = data.frame(Landtype = c("Rural", "Urban", "Rural", "Urban"))#ifelse condition data\$Landtype = ifelse(data\$Landtype == "Rural", 1,0)``

## selecting columns using a binary table in R

You can use `apply` to iterate over the columns of a binary matrix, `bin`, sub-settings a dataframe, `dat`:

``# create test dataset.seed(1)dat <- as.data.frame(matrix(rnorm(18), nrow=2))colnames(dat) <- paste0('c', 1:9)dat#           c1         c2         c3        c4         c5        c6         c7          c8# 1 -0.6264538 -0.8356286  0.3295078 0.4874291  0.5757814 1.5117812 -0.6212406  1.12493092# 2  0.1836433  1.5952808 -0.8204684 0.7383247 -0.3053884 0.3898432 -2.2146999 -0.04493361#            c9# 1 -0.01619026# 2  0.94383621bin <- matrix(sample(0:1, 27, replace = TRUE), nrow = 9)bin#       [,1] [,2] [,3]#  [1,]    1    1    0#  [2,]    0    0    0#  [3,]    1    0    0#  [4,]    0    1    1#  [5,]    1    1    1#  [6,]    1    0    0#  [7,]    1    1    1#  [8,]    1    0    0#  [9,]    1    0    0# subset columns of dat, using binary vector columns defined in bin;# drop = FALSE is included to prevent any columns with only a single "1" from# being cast to a vectorapply(bin, 2, function(x) { dat[, as.logical(x), drop = FALSE] })# []#           c1         c3         c5        c6         c7          c8          c9# 1 -0.6264538  0.3295078  0.5757814 1.5117812 -0.6212406  1.12493092 -0.01619026# 2  0.1836433 -0.8204684 -0.3053884 0.3898432 -2.2146999 -0.04493361  0.94383621# # []#           c1        c4         c5         c7# 1 -0.6264538 0.4874291  0.5757814 -0.6212406# 2  0.1836433 0.7383247 -0.3053884 -2.2146999# # []#          c4         c5         c7# 1 0.4874291  0.5757814 -0.6212406# 2 0.7383247 -0.3053884 -2.2146999# ``

## R - Function to make a binary variable

You can use :

``df[] <- +(df == 4 | df == 5)df#  var1 var2 var3#1    0    0   NA#2    1    0    1#3    0    1    1#4    0    1    0``

Comparison of `df == 4 | df == 5` returns logical values (`TRUE`/`FALSE`), `+` here turns those logical values to integer values (`1`/`0`) respectively.

If you want to apply this for selected columns you can subset the columns by position or by name.

``cols <- 1:3 #Position#cols <- grep('var', names(df)) #Namedf[cols] <- +(df[cols] == 4 | df[cols] == 5)``

As far as your function is concerned you can do :

``making_binary <- function (var){  var <- as.integer(var >= 4)  #which is faster version of  #var <- ifelse(var >= 4, 1, 0)  return(var)}df[] <- lapply(df, making_binary)``

data

``df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L, 5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))``

## Reshape data in R, cast function arguments

The OP asked for help with the arguments to the `cast()` function of the `reshape` package. However, the `reshape` package was superseded by the `reshape2` package from the same package author. According to the package description, the `reshape2` package is

A Reboot of the Reshape Package

Using `reshape2`, the desired result can be produced with

``reshape2::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,                 value.var = "TARGET_TYPE")#  PARENT_MOL_CHEMBL_ID ABL EGFR TP53#1                  C10   1    1    0#2                 C939   0    0    1``

BTW: The `data.table` package has implemented (and enhanced) `dcast()` as well. So, the same result can be produced with

``data.table::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,                   value.var = "TARGET_TYPE")``

The OP mentioned other columns in the data frame which should be shown together with the spread or wide data. Unfortunately, the OP hasn't supplied particular sample data, so we have to consider two use cases.

### Case 1: Additional columns go along with the id column

The data could look like

``wc#  PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1#1                  C10         ABL          a#2                  C10        EGFR          a#3                 C939        TP53          b``

Note that the values in `extra_col1` are in line with `PARENT_MOL_CHEMBL_ID`.

This is an easy case, because the formula in `dcast()` accepts `...` which represents all other variables not used in the formula:

``reshape2::dcast(wc, ... ~ TARGET_TYPE, fun.aggregate = length,                 value.var = "TARGET_TYPE")#  PARENT_MOL_CHEMBL_ID extra_col1 ABL EGFR TP53#1                  C10          a   1    1    0#2                 C939          b   0    0    1``

The resulting data.frame does contain all other columns.

### Case2: Additional columns don't go along with the id column

``wc#  PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1 extra_col2#1                  C10         ABL          a          1#2                  C10        EGFR          a          2#3                 C939        TP53          b          3``

Note that `extra_col2` has two different values for `C10`. This will cause the simple approach to fail. So, a two step approach has to be implemented: reshaping first and joining afterwards with the original data frame. The `data.table` package is used for both steps, now:

``library(data.table)# reshape from long to wide, result has only one row per id columnwide <- dcast(setDT(wc), PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,                 value.var = "TARGET_TYPE")# right join, i.e., all rows of wc are includedwide[wc, on = "PARENT_MOL_CHEMBL_ID"]#   PARENT_MOL_CHEMBL_ID ABL EGFR TP53 TARGET_TYPE extra_col1 extra_col2#1:                  C10   1    1    0         ABL          a          1#2:                  C10   1    1    0        EGFR          a          2#3:                 C939   0    0    1        TP53          b          3``

The result shows the aggregated values in wide format together with any other columns.

## How to convert two character columns to a binary matrix?

You can use:

``library(tidyverse)df %>%   pivot_wider(y,              names_from = x,               values_from = x,               values_fn = list(x = length),               values_fill = list(x = 0))  y         A     B     C  <chr> <int> <int> <int>1 m         1     0     02 n         1     0     03 o         0     1     04 p         0     0     15 q         0     0     16 r         0     0     1``