Combine Multiple Columns into Tidy Data

Combine Multiple Columns Into Tidy Data

Almost every data tidying problem can be solved in three steps:

  1. Gather all non-variable columns
  2. Separate "colname" column into multiple variables
  3. Re-spread the data

(often you'll only need one or two of these, but I think they're almost always in this order).

For your data:

  1. The only column that's already a variable is unique.id
  2. You need to split current column names into variable and number
  3. Then you need to put the "variable" variable back into columns

This looks like:

library(tidyr)
library(dplyr)

df3 %>%
gather(col, value, -unique.id, -intervention) %>%
separate(col, c("variable", "number")) %>%
spread(variable, value, convert = TRUE) %>%
mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))

Your case is a bit more complicated because you have two types of variables, so you need to restore the types at the end.

How to combine multiple columns in R

If you don't mind using external package, you can use separate_rows() from the tidyr package.

library(tidyverse)

df %>% separate_rows(-name, sep = "/")

# A tibble: 3 × 4
name age height weight
<chr> <chr> <chr> <chr>
1 Jack 1 12 30
2 Jack 2 15 40
3 Jack 3 18 37

combining columns into one in tidyverse

Tidyr's unite will do that for you.

library(tidyr)

iris %>% unite(New_Column, Sepal.Length,Species,Sepal.Width)

Output:

> iris %>% unite(New_Column, Sepal.Length,Species,Sepal.Width)
New_Column Petal.Length Petal.Width
1 5.1_setosa_3.5 1.4 0.2
2 4.9_setosa_3 1.4 0.2
3 4.7_setosa_3.2 1.3 0.2
4 4.6_setosa_3.1 1.5 0.2
5 5_setosa_3.6 1.4 0.2
6 5.4_setosa_3.9 1.7 0.4
7 4.6_setosa_3.4 1.4 0.3
8 5_setosa_3.4 1.5 0.2
9 4.4_setosa_2.9 1.4 0.2
10 4.9_setosa_3.1 1.5 0.1

combine multiple columns in R into a new vector column (preferably a tidyr solution)

Here, we can either use rowwise

library(dplyr)
df %>%
rowwise %>%
mutate(C = list(c(A, B))) %>%
ungroup
# A tibble: 3 x 3
# A B C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>

Or with map2 which by default return a list. Here, we are looping over corresponding elements of 'A', 'B', and concatenating (c)

library(dplyr)
library(purrr)
df %>%
mutate(C = map2(A, B, c))
# A tibble: 3 x 3
# A B C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>

Update

Based on OP's comments, if we want to create a list column with only columns that have a suffix _id

names(df) <- paste0(names(df), "_id")
df %>%
rowwise %>%
mutate(C = list(c_across(ends_with("_id")))) %>%
ungroup

-output

# A tibble: 3 x 3
# A_id B_id C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>

If the substring "_id" is at the beginning, change the ends_with to starts_with or use matches("^_id")

Or with pmap

df %>%
mutate(C = pmap(select(., ends_with("_id")), ~ c(...)))

-output

# A tibble: 3 x 3
# A_id B_id C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>

Or using Map from base R

df$C <-  do.call(Map, c(f = c, df[grep("_id", names(df))]))

Gather multiple sets of columns

This approach seems pretty natural to me:

df %>%
gather(key, value, -id, -time) %>%
extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
spread(question, value)

First gather all question columns, use extract() to separate into question and loop_number, then spread() question back into the columns.

#>    id       time loop_number         Q3.2        Q3.3
#> 1 1 2009-01-01 1 0.142259203 -0.35842736
#> 2 1 2009-01-01 2 0.061034802 0.79354061
#> 3 1 2009-01-01 3 -0.525686204 -0.67456611
#> 4 2 2009-01-02 1 -1.044461185 -1.19662936
#> 5 2 2009-01-02 2 0.393808163 0.42384717

R: How to tidy data contained in a single column into separate columns?

We can use separate/spread from tidyr. The separate splits the 'information' column into two columns and then with spread we reshape it to 'wide' format after changing the 'unit' to factor class (in case the order of columns are important).

library(dplyr)
library(tidyr)
separate(df1, information, into = c("value", "unit")) %>%
mutate(unit= factor(unit, levels =unique(unit))) %>%
spread(unit, value)
# name USD kg cm
#1 A 300 70 2
#2 B 400 90 5

data

df1 <- structure(list(name = c("A", "A", "A", "B", "B", "B"), information = c("300 USD", 
"70 kg", "2 cm", "400 USD", "90 kg", "5 cm")), .Names = c("name",
"information"), class = "data.frame", row.names = c(NA, -6L))

how do I gather 2 sets of columns in tidyr

If we are using gather, we can do this in two steps. First, we reshape from 'wide' to 'long' format for the column names that starts with 'category' and in the next step, we do the same with the numeric column names by selecting with matches. The matches can regex patterns, so a pattern of ^[0-9]+$ means we match one or more numbers ([0-9]+) from the start (^) to the end ($) of string. We can remove the columns that are not needed with select.

library(tidyr)
library(dplyr)
gather(df, key, category, starts_with('category_')) %>%
gather(key2, year, matches('^[0-9]+$')) %>%
select(-starts_with('key'))

Or using the devel version of data.table, this would be much easier as the melt can take multiple patterns for measure columns. We convert the 'data.frame' to 'data.table' (setDT(df)), use melt and specify the patterns with in the measure argument. We also have options to change the column names of the 'value' column. The 'variable' column is set to NULL as it was not needed in the expected output.

library(data.table)#v1.9.5+
melt(setDT(df), measure=patterns(c('^category', '^[0-9]+$')),
value.name=c('category', 'year'))[, variable:=NULL][]

How to specify multiple columns with gather() function to tidy data

In the gather function, value specifies the name of value column in the result; To specify which columns to gather, you can use start_column:end_column syntax, this will gather all columns from the start_column to end_column; In your case, it would be X0tot4:X20tot24:

df %>% gather(key = 'Age.group', value = 'Value.name', X0tot4:X20tot24)
# V V
# V V
# V V
# Country Country.Code Year Age.group Value.name
#1 Viet Nam 704 1955 X0tot4 4606
#2 Viet Nam 704 1960 X0tot4 5842
#3 Viet Nam 704 1965 X0tot4 6571
#4 Viet Nam 704 1970 X0tot4 7065
#5 Viet Nam 704 1975 X0tot4 7658
#6 Viet Nam 704 1980 X0tot4 7991
#7 Viet Nam 704 1985 X0tot4 8630

Using tidyr to combine multiple columns

We can use melt from data.table for this purpose as it can take multiple measure patterns

library(data.table)
melt(setDT(df1), measure = patterns("^Chg", "^Ctot"),
value.name = c("Chg", "Ctot"))[, variable := NULL][]
# Custno. Size Name Chg Ctot
#1: 61 2 XA A 2
#2: 61 2 XA B 4
#3: 61 2 XA C 5
#4: 61 2 XA D 6

Combine dfs by common column importing selected columns in R

Using SQL like joins, does this work:

library(dplyr)
df %>% inner_join(df2 %>% select(names, 'PA_df2' = PA)) %>%
inner_join(df3 %>% select(names, 'PA_df3' = PA)) %>%
inner_join(df4 %>% select(names, 'PA_df4' = PA))
Joining, by = "names"
Joining, by = "names"
Joining, by = "names"
names S1 S2 S3 S4 PA_df2 PA_df3 PA_df4
1 Obs1 1 2 0 0 2 3 30
2 Obs2 2 50 100 10 4 5 50
3 Obs3 2 40 135 17 5 7 70
4 Obs4 0 30 256 73 6 8 80
5 Obs5 1 22 303 74 7 7 70


Related Topics



Leave a reply



Submit