Gathering wide columns into multiple long columns using pivot_longer
I have found the answer to my question:
pivot_longer - transforms the columns in wide format starting with 'hf' and 'ac' to long format in separate columns
names_to parameters:
- .value = contains metadata on the cell values that correspond to the original columns
- these values are pivoted in long format and added in new columns "hf" and "ac"
- column "group" has the original column endings (e.g. the numbers 1-6) pivoted to long format
- names_pattern = regex argument specifying character "_" where column names are to be broken up
df3 <- df %>%
tidyr::pivot_longer(
cols = c(
starts_with("hf"),
starts_with("ac"),
starts_with("cs"),
starts_with("se")
),
names_to = c(".value", "level"),
names_pattern = "(.*)_(.*)"
)
Pivot_longer for multiple columns of repeated measurements data
This probably adds nothing new to the already posted solutions, the only difference is the regex
used for the names_pattern
argument.
- If you notice some of your column names are separated by one
_
whereas others are separated by two_
.\\w+
captures any word character, now if I specify we have a number after this with\\d+
as intime3
intime3_age
, we tellpivot_longer
to store this part of the column names corresponding totime3
intime
column. Then the rest of the column names are used for the variable names we are trying to measure lineage
,systolicBP
andmed_hypt
. - It should be noted that if we use
\\w+\\d+
instead of\\w+
only the rest will be captured as column names whether it ismed_hypt
with underscore orsystolicBP
without underscore. But if we use only\\w+
it could also capture med and the resulting column will behypt
instead ofmed_hypt
. - In the end since I defined two capture groups, I have to define either
names_pattern
ornames_sep
in a way to specify how each of them are defined and separated.
library(dplyr)
wide_data %>%
pivot_longer(!c(id, sex), names_to = c("time", ".value"),
names_pattern = "(\\w+\\d+)_(\\w+)")
# A tibble: 30 x 6
id sex time age systolicBP med_hypt
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women time1 71.2 102 0
2 12002 women time2 74.2 NA 0
3 12002 women time3 78 NA 0
4 17001 men time1 67.9 152 0
5 17001 men time2 69.2 146 0
6 17001 men time3 74.2 160. 0
7 17002 women time1 66.5 NA 0
8 17002 women time2 67.8 NA 0
9 17002 women time3 72.8 NA 0
10 42001 men time1 57.7 170 0
# ... with 20 more rows
Pivot_longer: Rotating multiple columns of data with same data types
You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value
identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern
.
library(dplyr)
library(tidyr)
df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')
# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris
Using pivot_longer with multiple paired columns in the wide dataset
You want to use .value
in the names_to
argument:
input %>%
pivot_longer(
-event,
names_to = c(".value", "item"),
names_sep = "_"
) %>%
select(-item)
# A tibble: 4 x 3
event url name
<int> <fct> <fct>
1 1 g1 dc
2 1 g2 sf
3 2 g3 nyc
4 2 g4 la
From this article on pivoting:
Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output).
Using pivot_longer to separate columns into long format
The data column names to be used in 'long' format doesn't all have the same pattern in column names. Therefore, the steps included are
rename columns that doesn't have the
...
or_
in their column names by adding those withpaste/str_c
reshape to long format with
pivot_longer
- taking into account the pattern in names with eithernames_sep
ornames_pattern
, specify thenames_to
as a vector ofc(".value", "trait")
in the same order we want the column values and the suffix value to be stored as separate columnsOnce we reshaped, create a grouping column based on the values in the 'trait' (some of them are numbers - create a logical vector and get the cumulative sum) along with the other grouping 'geno_name', 'observation_id' (which doesn't create a unique column though))
Now
summarise
the other columns by slicing the first row after ordering based on NA elements i.e. if there are no NA, the first value will be non-NA or else it will be NA
library(dplyr)
library(stringr)
library(tidyr)
x %>%
rename_at(vars(names(.)[!str_detect(names(.), "[_.]+")]),
~ str_c("value...", .)) %>%
pivot_longer(cols = 3:ncol(.),
names_to = c(".value", "trait"), names_sep = "\\.+") %>%
group_by(geno_name, observation_id,
grp = cumsum(str_detect(trait, "\\D+"))) %>%
summarise(across(everything(), ~ .[order(is.na(.))][1]),
.groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 6
# geno_name observation_id trait value unit method
# <chr> <dbl> <chr> <dbl> <chr> <chr>
#1 MB mixed 10 lipids NA <NA> <NA>
#2 MB mixed 10 density 1.12 g cm^-3 3D scanning
data
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,
-1L))
Gather or pivot_longer on multiple columns?
Consider this approach
df %>%
pivot_longer(matches("\\d$"), names_to = c("name", "year"), names_pattern = "([^\\d]+)(\\d+)$") %>%
pivot_wider()
First, transform the dataframe into one with only three columns id
, nameyear
, and value
; concurrently separate the second column nameyear
into name
and year
. Then, just pivot the two columns name
and value
wider.
Output
# A tibble: 14 x 4
id year emp marstat
<int> <chr> <chr> <chr>
1 1 1 ft married
2 1 2 ft divorced
3 2 1 ft married
4 2 2 ft married
5 3 1 pt divorced
6 3 2 ft divorced
7 4 1 pt single
8 4 2 ft single
9 5 1 ft single
10 5 2 no single
11 6 1 no single
12 6 2 pt married
13 7 1 no single
14 7 2 ft single
How do we transform a dataset in R using pivot_longer with multiple columns
Probably not the most elegant solution, but I was able to solve my own problem using the steps below:
a <- df %>%
select(person,initial_event_date, type_initial) %>%
mutate(visit_type = 'initial')
b <- df %>%
filter(visit_prior == 'Y') %>%
select(person, initial_event_date, prior_visit_type, day_cnt_prior) %>%
mutate(visit_type = 'visit_prior',
day_cnt_prior = as.integer(day_cnt_prior))
c <- df %>% filter(visit_after == 'Y') %>%
select(person, initial_event_date, visit_after_type, day_cnt_after) %>%
mutate(visit_type = 'visit_after',
day_cnt_after = as.integer(day_cnt_after))
bind_rows(a,b,c) %>%
arrange(person) %>%
mutate(visit_reason = dplyr::coalesce(type_initial, prior_visit_type, visit_after_type),
visit_type = dplyr::coalesce(visit_type),
day_cnt = dplyr::coalesce(day_cnt_after, day_cnt_prior)) %>%
select(person, initial_event_date,visit_type, visit_reason, day_cnt) %>%
replace_na(list(day_cnt = 0))
Using pivot_longer to restructure wide data, with multiple columns, from a spreadsheet
If we are interested in returning the 'FullName' and the 'SOCW' columns (duplicated) in single column, select
the columns of interest, then use pivot_longer
with names_pattern
as the ".value"
and capture the substring from the column name without the .
([^.]+
) followed by digits.
library(dplyr)
library(tidyr)
my_data %>%
select(FullName, starts_with("SOCW")) %>%
pivot_longer(cols = starts_with("SOCW"), names_to = ".value",
names_pattern = '^(SOCW[^.]+)')
# A tibble: 6 x 6
FullName SOCW725 SOCW748 SOCW799 SOCW752 SOCW782
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Beavis B 3.5 3.22 2.56 3.33 4.2
2 Beavis B 2.33 3.23 NA NA NA
3 Beavis B 3.33 NA NA NA NA
4 El Guapo 3.25 3.02 2.75 4.33 4.15
5 El Guapo 3.33 3.42 NA 4 NA
6 El Guapo 2.67 NA NA NA NA
data.frame
doesn't by default allow duplicate column names. It uses make.unique
to modify the column names by appending .1
, .2
, etc. for each duplicates.
if we need only three columns
library(stringr)
my_data %>%
select(FullName, starts_with("SOCW")) %>%
pivot_longer(cols = starts_with("SOCW")) %>%
mutate(name = str_remove(name, "\\.\\d+$"))
# A tibble: 18 x 3
FullName name value
<chr> <chr> <dbl>
1 Beavis B SOCW725 3.5
2 Beavis B SOCW748 3.22
3 Beavis B SOCW799 2.56
4 Beavis B SOCW725 2.33
5 Beavis B SOCW752 3.33
6 Beavis B SOCW782 4.2
7 Beavis B SOCW725 3.33
8 Beavis B SOCW748 3.23
9 Beavis B SOCW752 NA
10 El Guapo SOCW725 3.25
11 El Guapo SOCW748 3.02
12 El Guapo SOCW799 2.75
13 El Guapo SOCW725 3.33
14 El Guapo SOCW752 4.33
15 El Guapo SOCW782 4.15
16 El Guapo SOCW725 2.67
17 El Guapo SOCW748 3.42
18 El Guapo SOCW752 4
data
my_data <- structure(list(FullName = c("Beavis B", "El Guapo"), SOCW725 = c(3.5,
3.25), SOCW748 = c(3.22, 3.02), SOCW799 = c(2.56, 2.75), Average = c(3.07,
3.18), SOCW725.1 = c(2.33, 3.33), SOCW752 = c(3.33, 4.33), SOCW782 = c(4.2,
4.15), Average.1 = c(3.5, 2.25), SOCW725.2 = c(3.33, 2.67), SOCW748.1 = c(3.23,
3.42), SOCW752.1 = c(NA, 4L), Average.2 = c(3, 2.44)),
class = "data.frame", row.names = c(NA,
-2L))
Related Topics
Evaluate (I.E., Predict) a Smoothing Spline Outside R
How to Save a Plot Made with Ggplot2 as Svg
How to Convert Ensembl Id to Gene Symbol in R
R: Legend with Points and Lines Being Different Colors (For the Same Legend Item)
Manipulating Multiple Files in R
Plot Every Column in a Data Frame as a Histogram on One Page Using Ggplot
Cbind Warnings:Row Names Were Found from a Short Variable and Have Been Discarded
Remove the Last Element of a Vector
Add New Variable to List of Data Frames with Purrr and Mutate() from Dplyr
Specifying Xlim and Ylim When Using Log-Scale in R
Bookmarking and Saving the Bookmarks in R Shiny
How to Calculate the 95% Confidence Interval for the Slope in a Linear Regression Model in R
Predicting Lda Topics for New Data
How to Hide Code in Rmarkdown, with Option to See It
How Exactly Does R Parse '->', the Right-Assignment Operator
Duplicate a Column in Data Frame and Rename It to Another Column Name