Split Concatenated Column to Corresponding Column Positions

Split concatenated column to corresponding column positions

Simply do:

splt <- strsplit(as.character(df$FOO),"\\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x){all_val[!(all_val %in% x)]<-NA;all_val}))

# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"

data:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))

Please note:

My version is base:: (no libraries needed) and general:

It would also work with:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))

string split values in two columns, and then concatenate them into a new column

You can use your function and the tidyverse

Use mutate() with map2(.f = my.function) to create a nested ID column containing a list column with all IDs per row(some have 1 ID, some have two in the example data). Then you can unnest_wider() to create several different ID columns, which you can latter collapse using tidyr::unite()

library(tidyr)
library(dplyr)
library(stringr)
library(purrr)

df %>% mutate(ID=map2(Proteins, Positions.within.proteins, my.function))%>%
unnest_wider(ID, names_sep = '.')%>%
unite(contains('ID'), col='ID', remove = TRUE, sep=";", na.rm=TRUE)

# A tibble: 10 x 3
Proteins Positions.within.proteins ID
<chr> <chr> <chr>
1 Q99755;A2A3N6 276;223 Q99755_276;A2A3N6_223
2 O00329 708 O00329_708
3 O00444 41 O00444_41
4 O14965 162 O14965_162
5 O14976 175 O14976_175
6 Q6A1A2;O15530 84;111 Q6A1A2_84;O15530_111
7 O43318 63 O43318_63
8 O43526 628 O43526_628
9 O43930;P51817 78;78 O43930_78;P51817_78
10 O60331 270 O60331_270

In R, can you separate text to columns so that values align?

You can pivot your data a few times to align these values:

library(dplyr)

df %>%
tibble::rowid_to_column("id") %>%
tidyr::separate_rows(Col1, sep = "; ") %>%
tidyr::pivot_wider(id_cols = id,
names_from = Col1,
values_from = Col1) %>%
dplyr::select(-id) %>%
magrittr::set_colnames(paste0("Col", 1:ncol(.)))

Output

 Col1  Col2  Col3    Col4  Col5 
<chr> <chr> <chr> <chr> <chr>
1 camel cow giraffe panda zebra
2 camel NA giraffe NA zebra
3 NA NA NA panda zebra

How it works

  1. rowid_to_column keeps track of the row numbers so when the data are pivoted to a longer format, we don't lose track of which values belong in which rows.
  2. separate_rows will separate Col1 and pivot the data to a longer format. Suggested by @Adam as an improvement.
  3. pivot_wider aligns everything into the columns you specified.

Data

structure(list(Col1 = c("camel; cow; giraffe; panda; zebra", 
"camel; giraffe; zebra", "panda; zebra")), class = "data.frame", row.names = c(NA,
-3L))

How to split a column of a variable number of concatenated tags into one column per tag?

The separate_rows function from tidyr may help you get where you want. This splits the strings within tags into separate rows instead of separate columns, which sets you up to use spread.

To get the TRUE/FALSE result I created a new column of all TRUE to use as the value column, and then filled the missing with FALSE in spread. In the end,spread kept the blank cell as a column name, which I removed via select. There may be a better way to do this (maybe convert to NA?).

library(tidyr)
library(dplyr)

data %>%
separate_rows(tags) %>%
mutate(tagslog = TRUE) %>%
spread(tags, tagslog, fill = FALSE) %>%
select(-one_of(""))

key A B C D E
* <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 a TRUE TRUE FALSE FALSE FALSE
2 b FALSE TRUE FALSE FALSE FALSE
3 c TRUE FALSE FALSE FALSE TRUE
4 d FALSE FALSE TRUE TRUE FALSE
5 e FALSE FALSE FALSE FALSE FALSE

You can almost get where you want with just separate_rows and table, but I still had that extra blank column that would need to be removed.

data %>%
separate_rows(tags) %>%
with(., table(key, tags) == 1)

tags
key A B C D E
a FALSE TRUE TRUE FALSE FALSE FALSE
b FALSE FALSE TRUE FALSE FALSE FALSE
c FALSE TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE FALSE TRUE TRUE FALSE
e TRUE FALSE FALSE FALSE FALSE FALSE

Splitting a column into multiple columns in R, when there is no separator

separate can accept column positions in the sep argument. This acts as if there were separators after columns 1, 2, ..., 7.

library(tidyr)

separate(x, y, into = paste0("y", 1:8), sep = 1:7)

giving:

  y1 y2 y3 y4 y5 y6 y7 y8
1 1 2 3 4 5 6 7 8
2 8 7 6 5 4 3 2 1

Split word in column in R

With base R:

df$size <- substr(df$age,1,1)
df$age <- substr(df$age,2,2)

And to get the result in the column order you specified:

df[,c("fas","value","age","size","colony")]
fas value age size colony
1 C12:0 0.002221915 O L 7_13
2 C13:0 0.000770179 O L 7_13
3 C14:0 0.004525352 O L 7_13
4 C15:0 0.000738928 O L 7_13
5 C16:1a 0.002964627 O L 7_13

How to go from wide to long data, when each column need to be split into 3+ columns

library(dplyr)
library(reshape2)
library(tidyr)

reshape2::melt(messy,id.vars="id") %>%
tidyr::separate("variable",into = c("drop","cue","direction")) %>%
select(-drop)
   id     cue direction value
1 P1 neutral up 1.2
2 P2 neutral up 1.3
3 P3 neutral up 1.2
4 P1 neutral down 2.1
5 P2 neutral down 3.1
6 P3 neutral down 2.1
7 P1 valid up 1.2
8 P2 valid up 1.3
9 P3 valid up 1.2
10 P1 valid down 2.1
11 P2 valid down 3.1
12 P3 valid down 2.1
13 P1 invalid up 1.2
14 P2 invalid up 1.3
15 P3 invalid up 1.2
16 P1 invalid down 2.1
17 P2 invalid down 3.1
18 P3 invalid down 2.1

Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?

You could use strsplit and the following regular expression to separate the string and then do.call and rbind to create a new dataframe with each part in its own column.

CODE TO MATCH SPECIFIC # OF UNDERSCORES

df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))

df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))

df_new

OUTPUT

                       X1  X2
1 TV_Banana_122_Afternoon Pre

Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+ to the first match in sub. See example below.

CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE

df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))

df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))

df2_new

OUTPUT

                            X1  X2
1 TV_Banana_122_Afternoon_Test Pre

Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.

CODE TO MATCH THE LAST UNDERSCORE

df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
df_new2

OUTPUT

                   X1  X2
1 TV_Banana_122_Afternoon Pre


Related Topics



Leave a reply



Submit