Gathering Wide Columns into Multiple Long Columns Using Pivot_Longer

Gathering wide columns into multiple long columns using pivot_longer

I have found the answer to my question:

pivot_longer - transforms the columns in wide format starting with 'hf' and 'ac' to long format in separate columns

names_to parameters:

  • .value = contains metadata on the cell values that correspond to the original columns
  • these values are pivoted in long format and added in new columns "hf" and "ac"
  • column "group" has the original column endings (e.g. the numbers 1-6) pivoted to long format
  • names_pattern = regex argument specifying character "_" where column names are to be broken up
df3 <- df %>%
tidyr::pivot_longer(
cols = c(
starts_with("hf"),
starts_with("ac"),
starts_with("cs"),
starts_with("se")
),
names_to = c(".value", "level"),
names_pattern = "(.*)_(.*)"
)

Pivot_longer for multiple columns of repeated measurements data

This probably adds nothing new to the already posted solutions, the only difference is the regex used for the names_pattern argument.

  • If you notice some of your column names are separated by one _ whereas others are separated by two _. \\w+ captures any word character, now if I specify we have a number after this with \\d+ as in time3 in time3_age, we tell pivot_longer to store this part of the column names corresponding to time3 in time column. Then the rest of the column names are used for the variable names we are trying to measure line age, systolicBP and med_hypt.
  • It should be noted that if we use \\w+\\d+ instead of \\w+ only the rest will be captured as column names whether it is med_hypt with underscore or systolicBP without underscore. But if we use only \\w+ it could also capture med and the resulting column will be hypt instead of med_hypt.
  • In the end since I defined two capture groups, I have to define either names_pattern or names_sep in a way to specify how each of them are defined and separated.
library(dplyr)

wide_data %>%
pivot_longer(!c(id, sex), names_to = c("time", ".value"),
names_pattern = "(\\w+\\d+)_(\\w+)")

# A tibble: 30 x 6
id sex time age systolicBP med_hypt
<dbl> <fct> <chr> <dbl> <dbl> <dbl>
1 12002 women time1 71.2 102 0
2 12002 women time2 74.2 NA 0
3 12002 women time3 78 NA 0
4 17001 men time1 67.9 152 0
5 17001 men time2 69.2 146 0
6 17001 men time3 74.2 160. 0
7 17002 women time1 66.5 NA 0
8 17002 women time2 67.8 NA 0
9 17002 women time3 72.8 NA 0
10 42001 men time1 57.7 170 0
# ... with 20 more rows

Pivot_longer: Rotating multiple columns of data with same data types

You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern.

library(dplyr)
library(tidyr)

df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')

# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris

Using pivot_longer with multiple paired columns in the wide dataset

You want to use .value in the names_to argument:

input %>%
pivot_longer(
-event,
names_to = c(".value", "item"),
names_sep = "_"
) %>%
select(-item)

# A tibble: 4 x 3
event url name
<int> <fct> <fct>
1 1 g1 dc
2 1 g2 sf
3 2 g3 nyc
4 2 g4 la

From this article on pivoting:

Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output).

Using pivot_longer to separate columns into long format

The data column names to be used in 'long' format doesn't all have the same pattern in column names. Therefore, the steps included are

  • rename columns that doesn't have the ... or _ in their column names by adding those with paste/str_c

  • reshape to long format with pivot_longer - taking into account the pattern in names with either names_sep or names_pattern, specify the names_to as a vector of c(".value", "trait") in the same order we want the column values and the suffix value to be stored as separate columns

  • Once we reshaped, create a grouping column based on the values in the 'trait' (some of them are numbers - create a logical vector and get the cumulative sum) along with the other grouping 'geno_name', 'observation_id' (which doesn't create a unique column though))

  • Now summarise the other columns by slicing the first row after ordering based on NA elements i.e. if there are no NA, the first value will be non-NA or else it will be NA

library(dplyr)
library(stringr)
library(tidyr)
x %>%
rename_at(vars(names(.)[!str_detect(names(.), "[_.]+")]),
~ str_c("value...", .)) %>%
pivot_longer(cols = 3:ncol(.),
names_to = c(".value", "trait"), names_sep = "\\.+") %>%
group_by(geno_name, observation_id,
grp = cumsum(str_detect(trait, "\\D+"))) %>%
summarise(across(everything(), ~ .[order(is.na(.))][1]),
.groups = 'drop') %>%
select(-grp)

-output

# A tibble: 2 x 6
# geno_name observation_id trait value unit method
# <chr> <dbl> <chr> <dbl> <chr> <chr>
#1 MB mixed 10 lipids NA <NA> <NA>
#2 MB mixed 10 density 1.12 g cm^-3 3D scanning

data

x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA, 
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,
-1L))

Gather or pivot_longer on multiple columns?

Consider this approach

df %>% 
pivot_longer(matches("\\d$"), names_to = c("name", "year"), names_pattern = "([^\\d]+)(\\d+)$") %>%
pivot_wider()

First, transform the dataframe into one with only three columns id, nameyear, and value; concurrently separate the second column nameyear into name and year. Then, just pivot the two columns name and value wider.

Output

# A tibble: 14 x 4
id year emp marstat
<int> <chr> <chr> <chr>
1 1 1 ft married
2 1 2 ft divorced
3 2 1 ft married
4 2 2 ft married
5 3 1 pt divorced
6 3 2 ft divorced
7 4 1 pt single
8 4 2 ft single
9 5 1 ft single
10 5 2 no single
11 6 1 no single
12 6 2 pt married
13 7 1 no single
14 7 2 ft single

How do we transform a dataset in R using pivot_longer with multiple columns

Probably not the most elegant solution, but I was able to solve my own problem using the steps below:

a <- df %>% 
select(person,initial_event_date, type_initial) %>%
mutate(visit_type = 'initial')
b <- df %>%
filter(visit_prior == 'Y') %>%
select(person, initial_event_date, prior_visit_type, day_cnt_prior) %>%
mutate(visit_type = 'visit_prior',
day_cnt_prior = as.integer(day_cnt_prior))
c <- df %>% filter(visit_after == 'Y') %>%
select(person, initial_event_date, visit_after_type, day_cnt_after) %>%
mutate(visit_type = 'visit_after',
day_cnt_after = as.integer(day_cnt_after))

bind_rows(a,b,c) %>%
arrange(person) %>%
mutate(visit_reason = dplyr::coalesce(type_initial, prior_visit_type, visit_after_type),
visit_type = dplyr::coalesce(visit_type),
day_cnt = dplyr::coalesce(day_cnt_after, day_cnt_prior)) %>%
select(person, initial_event_date,visit_type, visit_reason, day_cnt) %>%
replace_na(list(day_cnt = 0))

Using pivot_longer to restructure wide data, with multiple columns, from a spreadsheet

If we are interested in returning the 'FullName' and the 'SOCW' columns (duplicated) in single column, select the columns of interest, then use pivot_longer with names_pattern as the ".value" and capture the substring from the column name without the . ([^.]+) followed by digits.

library(dplyr)
library(tidyr)
my_data %>%
select(FullName, starts_with("SOCW")) %>%
pivot_longer(cols = starts_with("SOCW"), names_to = ".value",
names_pattern = '^(SOCW[^.]+)')
# A tibble: 6 x 6
FullName SOCW725 SOCW748 SOCW799 SOCW752 SOCW782
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Beavis B 3.5 3.22 2.56 3.33 4.2
2 Beavis B 2.33 3.23 NA NA NA
3 Beavis B 3.33 NA NA NA NA
4 El Guapo 3.25 3.02 2.75 4.33 4.15
5 El Guapo 3.33 3.42 NA 4 NA
6 El Guapo 2.67 NA NA NA NA

data.frame doesn't by default allow duplicate column names. It uses make.unique to modify the column names by appending .1, .2, etc. for each duplicates.


if we need only three columns

library(stringr)
my_data %>%
select(FullName, starts_with("SOCW")) %>%
pivot_longer(cols = starts_with("SOCW")) %>%
mutate(name = str_remove(name, "\\.\\d+$"))
# A tibble: 18 x 3
FullName name value
<chr> <chr> <dbl>
1 Beavis B SOCW725 3.5
2 Beavis B SOCW748 3.22
3 Beavis B SOCW799 2.56
4 Beavis B SOCW725 2.33
5 Beavis B SOCW752 3.33
6 Beavis B SOCW782 4.2
7 Beavis B SOCW725 3.33
8 Beavis B SOCW748 3.23
9 Beavis B SOCW752 NA
10 El Guapo SOCW725 3.25
11 El Guapo SOCW748 3.02
12 El Guapo SOCW799 2.75
13 El Guapo SOCW725 3.33
14 El Guapo SOCW752 4.33
15 El Guapo SOCW782 4.15
16 El Guapo SOCW725 2.67
17 El Guapo SOCW748 3.42
18 El Guapo SOCW752 4

data

my_data <- structure(list(FullName = c("Beavis B", "El Guapo"), SOCW725 = c(3.5, 
3.25), SOCW748 = c(3.22, 3.02), SOCW799 = c(2.56, 2.75), Average = c(3.07,
3.18), SOCW725.1 = c(2.33, 3.33), SOCW752 = c(3.33, 4.33), SOCW782 = c(4.2,
4.15), Average.1 = c(3.5, 2.25), SOCW725.2 = c(3.33, 2.67), SOCW748.1 = c(3.23,
3.42), SOCW752.1 = c(NA, 4L), Average.2 = c(3, 2.44)),
class = "data.frame", row.names = c(NA,
-2L))


Related Topics



Leave a reply



Submit