Combine Multiple Columns Into Tidy Data
Almost every data tidying problem can be solved in three steps:
- Gather all non-variable columns
- Separate "colname" column into multiple variables
- Re-spread the data
(often you'll only need one or two of these, but I think they're almost always in this order).
For your data:
- The only column that's already a variable is
unique.id
- You need to split current column names into variable and number
- Then you need to put the "variable" variable back into columns
This looks like:
library(tidyr)
library(dplyr)
df3 %>%
gather(col, value, -unique.id, -intervention) %>%
separate(col, c("variable", "number")) %>%
spread(variable, value, convert = TRUE) %>%
mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))
Your case is a bit more complicated because you have two types of variables, so you need to restore the types at the end.
How to combine multiple columns in R
If you don't mind using external package, you can use separate_rows()
from the tidyr
package.
library(tidyverse)
df %>% separate_rows(-name, sep = "/")
# A tibble: 3 × 4
name age height weight
<chr> <chr> <chr> <chr>
1 Jack 1 12 30
2 Jack 2 15 40
3 Jack 3 18 37
combining columns into one in tidyverse
Tidyr's unite
will do that for you.
library(tidyr)
iris %>% unite(New_Column, Sepal.Length,Species,Sepal.Width)
Output:
> iris %>% unite(New_Column, Sepal.Length,Species,Sepal.Width)
New_Column Petal.Length Petal.Width
1 5.1_setosa_3.5 1.4 0.2
2 4.9_setosa_3 1.4 0.2
3 4.7_setosa_3.2 1.3 0.2
4 4.6_setosa_3.1 1.5 0.2
5 5_setosa_3.6 1.4 0.2
6 5.4_setosa_3.9 1.7 0.4
7 4.6_setosa_3.4 1.4 0.3
8 5_setosa_3.4 1.5 0.2
9 4.4_setosa_2.9 1.4 0.2
10 4.9_setosa_3.1 1.5 0.1
combine multiple columns in R into a new vector column (preferably a tidyr solution)
Here, we can either use rowwise
library(dplyr)
df %>%
rowwise %>%
mutate(C = list(c(A, B))) %>%
ungroup
# A tibble: 3 x 3
# A B C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>
Or with map2
which by default return a list
. Here, we are looping over corresponding elements of 'A', 'B', and concatenating (c
)
library(dplyr)
library(purrr)
df %>%
mutate(C = map2(A, B, c))
# A tibble: 3 x 3
# A B C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>
Update
Based on OP's comments, if we want to create a list
column with only columns that have a suffix _id
names(df) <- paste0(names(df), "_id")
df %>%
rowwise %>%
mutate(C = list(c_across(ends_with("_id")))) %>%
ungroup
-output
# A tibble: 3 x 3
# A_id B_id C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>
If the substring "_id"
is at the beginning, change the ends_with
to starts_with
or use matches("^_id")
Or with pmap
df %>%
mutate(C = pmap(select(., ends_with("_id")), ~ c(...)))
-output
# A tibble: 3 x 3
# A_id B_id C
# <dbl> <dbl> <list>
#1 1 2 <dbl [2]>
#2 3 4 <dbl [2]>
#3 5 6 <dbl [2]>
Or using Map
from base R
df$C <- do.call(Map, c(f = c, df[grep("_id", names(df))]))
Gather multiple sets of columns
This approach seems pretty natural to me:
df %>%
gather(key, value, -id, -time) %>%
extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
spread(question, value)
First gather all question columns, use extract()
to separate into question
and loop_number
, then spread()
question back into the columns.
#> id time loop_number Q3.2 Q3.3
#> 1 1 2009-01-01 1 0.142259203 -0.35842736
#> 2 1 2009-01-01 2 0.061034802 0.79354061
#> 3 1 2009-01-01 3 -0.525686204 -0.67456611
#> 4 2 2009-01-02 1 -1.044461185 -1.19662936
#> 5 2 2009-01-02 2 0.393808163 0.42384717
R: How to tidy data contained in a single column into separate columns?
We can use separate/spread
from tidyr
. The separate
splits the 'information' column into two columns and then with spread
we reshape it to 'wide' format after changing the 'unit' to factor
class (in case the order of columns are important).
library(dplyr)
library(tidyr)
separate(df1, information, into = c("value", "unit")) %>%
mutate(unit= factor(unit, levels =unique(unit))) %>%
spread(unit, value)
# name USD kg cm
#1 A 300 70 2
#2 B 400 90 5
data
df1 <- structure(list(name = c("A", "A", "A", "B", "B", "B"), information = c("300 USD",
"70 kg", "2 cm", "400 USD", "90 kg", "5 cm")), .Names = c("name",
"information"), class = "data.frame", row.names = c(NA, -6L))
how do I gather 2 sets of columns in tidyr
If we are using gather
, we can do this in two steps. First, we reshape from 'wide' to 'long' format for the column names that starts with 'category' and in the next step, we do the same with the numeric column names by selecting with matches
. The matches
can regex patterns, so a pattern of ^[0-9]+$
means we match one or more numbers ([0-9]+
) from the start (^
) to the end ($
) of string. We can remove the columns that are not needed with select
.
library(tidyr)
library(dplyr)
gather(df, key, category, starts_with('category_')) %>%
gather(key2, year, matches('^[0-9]+$')) %>%
select(-starts_with('key'))
Or using the devel version of data.table
, this would be much easier as the melt
can take multiple patterns for measure
columns. We convert the 'data.frame' to 'data.table' (setDT(df)
), use melt
and specify the patterns
with in the measure
argument. We also have options to change the column names of the 'value' column. The 'variable' column is set to NULL as it was not needed in the expected output.
library(data.table)#v1.9.5+
melt(setDT(df), measure=patterns(c('^category', '^[0-9]+$')),
value.name=c('category', 'year'))[, variable:=NULL][]
How to specify multiple columns with gather() function to tidy data
In the gather
function, value
specifies the name of value column in the result; To specify which columns to gather, you can use start_column:end_column
syntax, this will gather all columns from the start_column to end_column; In your case, it would be X0tot4:X20tot24
:
df %>% gather(key = 'Age.group', value = 'Value.name', X0tot4:X20tot24)
# V V
# V V
# V V
# Country Country.Code Year Age.group Value.name
#1 Viet Nam 704 1955 X0tot4 4606
#2 Viet Nam 704 1960 X0tot4 5842
#3 Viet Nam 704 1965 X0tot4 6571
#4 Viet Nam 704 1970 X0tot4 7065
#5 Viet Nam 704 1975 X0tot4 7658
#6 Viet Nam 704 1980 X0tot4 7991
#7 Viet Nam 704 1985 X0tot4 8630
Using tidyr to combine multiple columns
We can use melt
from data.table
for this purpose as it can take multiple measure
patterns
library(data.table)
melt(setDT(df1), measure = patterns("^Chg", "^Ctot"),
value.name = c("Chg", "Ctot"))[, variable := NULL][]
# Custno. Size Name Chg Ctot
#1: 61 2 XA A 2
#2: 61 2 XA B 4
#3: 61 2 XA C 5
#4: 61 2 XA D 6
Combine dfs by common column importing selected columns in R
Using SQL like joins, does this work:
library(dplyr)
df %>% inner_join(df2 %>% select(names, 'PA_df2' = PA)) %>%
inner_join(df3 %>% select(names, 'PA_df3' = PA)) %>%
inner_join(df4 %>% select(names, 'PA_df4' = PA))
Joining, by = "names"
Joining, by = "names"
Joining, by = "names"
names S1 S2 S3 S4 PA_df2 PA_df3 PA_df4
1 Obs1 1 2 0 0 2 3 30
2 Obs2 2 50 100 10 4 5 50
3 Obs3 2 40 135 17 5 7 70
4 Obs4 0 30 256 73 6 8 80
5 Obs5 1 22 303 74 7 7 70
Related Topics
Long/Bigint/Decimal Equivalent Datatype in R
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
Combine Two or More Columns in a Dataframe into a New Column With a New Name
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Horizontal/Vertical Line in Plotly
Ggplot, Drawing Line Between Points Across Facets
Check If the Number Is Integer
How to Subtract Months from a Date in R
Finding Percentage in a Sub-Group Using Group_By and Summarise
How to Suppress Warnings Globally in an R Script
How Does One Reorder Columns in a Data Frame
Extracting the Last N Characters from a String in R
What Is the Width Argument in Position_Dodge
Dplyr Mutate Rowsums Calculations or Custom Functions
Finding Rows Containing a Value (Or Values) in Any Column
How to Calculate Cumulative Sum
Extract Regression Coefficient Values
What's Wrong With My Function to Load Multiple .Csv Files into Single Dataframe in R Using Rbind