Split Concatenated Column to Corresponding Column Positions

Split concatenated column to corresponding column positions

Simply do:

splt <- strsplit(as.character(df$FOO),"\\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x){all_val[!(all_val %in% x)]<-NA;all_val}))

#     [,1] [,2] [,3]
#[1,] "A"  "B"  "C" 
#[2,] "A"  "B"  NA  
#[3,] NA   "B"  "C" 
#[4,] "A"  NA   NA  
#[5,] NA   NA   "C"

data:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))

Please note:

My version is base:: (no libraries needed) and general:

It would also work with:

df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))

string split values in two columns, and then concatenate them into a new column

You can use your function and the tidyverse

Use mutate() with map2(.f = my.function) to create a nested ID column containing a list column with all IDs per row(some have 1 ID, some have two in the example data). Then you can unnest_wider() to create several different ID columns, which you can latter collapse using tidyr::unite()

library(tidyr)
library(dplyr)
library(stringr)
library(purrr)

df %>% mutate(ID=map2(Proteins, Positions.within.proteins, my.function))%>%
        unnest_wider(ID, names_sep = '.')%>%
        unite(contains('ID'), col='ID', remove = TRUE, sep=";", na.rm=TRUE)

# A tibble: 10 x 3
   Proteins      Positions.within.proteins ID                   
   <chr>         <chr>                     <chr>                
 1 Q99755;A2A3N6 276;223                   Q99755_276;A2A3N6_223
 2 O00329        708                       O00329_708           
 3 O00444        41                        O00444_41            
 4 O14965        162                       O14965_162           
 5 O14976        175                       O14976_175           
 6 Q6A1A2;O15530 84;111                    Q6A1A2_84;O15530_111 
 7 O43318        63                        O43318_63            
 8 O43526        628                       O43526_628           
 9 O43930;P51817 78;78                     O43930_78;P51817_78  
10 O60331        270                       O60331_270

In R, can you separate text to columns so that values align?

You can pivot your data a few times to align these values:

library(dplyr)

df %>% 
  tibble::rowid_to_column("id") %>% 
  tidyr::separate_rows(Col1, sep = "; ") %>% 
  tidyr::pivot_wider(id_cols = id,
                     names_from = Col1,
                     values_from = Col1) %>% 
  dplyr::select(-id) %>% 
  magrittr::set_colnames(paste0("Col", 1:ncol(.)))

Output

 Col1  Col2  Col3    Col4  Col5 
  <chr> <chr> <chr>   <chr> <chr>
1 camel cow   giraffe panda zebra
2 camel NA    giraffe NA    zebra
3 NA    NA    NA      panda zebra

How it works

rowid_to_column keeps track of the row numbers so when the data are pivoted to a longer format, we don't lose track of which values belong in which rows.
separate_rows will separate Col1 and pivot the data to a longer format. Suggested by @Adam as an improvement.
pivot_wider aligns everything into the columns you specified.

Data

structure(list(Col1 = c("camel; cow; giraffe; panda; zebra", 
"camel; giraffe; zebra", "panda; zebra")), class = "data.frame", row.names = c(NA, 
-3L))

How to split a column of a variable number of concatenated tags into one column per tag?

The separate_rows function from tidyr may help you get where you want. This splits the strings within tags into separate rows instead of separate columns, which sets you up to use spread.

To get the TRUE/FALSE result I created a new column of all TRUE to use as the value column, and then filled the missing with FALSE in spread. In the end,spread kept the blank cell as a column name, which I removed via select. There may be a better way to do this (maybe convert to NA?).

library(tidyr)
library(dplyr)

data %>%
    separate_rows(tags) %>%
    mutate(tagslog = TRUE) %>%
    spread(tags, tagslog, fill = FALSE) %>%
    select(-one_of(""))

    key     A     B     C     D     E
* <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1     a  TRUE  TRUE FALSE FALSE FALSE
2     b FALSE  TRUE FALSE FALSE FALSE
3     c  TRUE FALSE FALSE FALSE  TRUE
4     d FALSE FALSE  TRUE  TRUE FALSE
5     e FALSE FALSE FALSE FALSE FALSE

You can almost get where you want with just separate_rows and table, but I still had that extra blank column that would need to be removed.

data %>%
    separate_rows(tags) %>%
    with(., table(key, tags) == 1)

   tags
key           A     B     C     D     E
  a FALSE  TRUE  TRUE FALSE FALSE FALSE
  b FALSE FALSE  TRUE FALSE FALSE FALSE
  c FALSE  TRUE FALSE FALSE FALSE  TRUE
  d FALSE FALSE FALSE  TRUE  TRUE FALSE
  e  TRUE FALSE FALSE FALSE FALSE FALSE

Splitting a column into multiple columns in R, when there is no separator

separate can accept column positions in the sep argument. This acts as if there were separators after columns 1, 2, ..., 7.

library(tidyr)

separate(x, y, into = paste0("y", 1:8), sep = 1:7)

giving:

  y1 y2 y3 y4 y5 y6 y7 y8
1  1  2  3  4  5  6  7  8
2  8  7  6  5  4  3  2  1

Split word in column in R

With base R:

df$size <- substr(df$age,1,1)
df$age  <- substr(df$age,2,2)

And to get the result in the column order you specified:

df[,c("fas","value","age","size","colony")]
     fas       value age size colony
1  C12:0 0.002221915   O    L   7_13
2  C13:0 0.000770179   O    L   7_13
3  C14:0 0.004525352   O    L   7_13
4  C15:0 0.000738928   O    L   7_13
5 C16:1a 0.002964627   O    L   7_13

How to go from wide to long data, when each column need to be split into 3+ columns

library(dplyr)
library(reshape2)
library(tidyr)

reshape2::melt(messy,id.vars="id") %>%
    tidyr::separate("variable",into = c("drop","cue","direction")) %>%
    select(-drop)

   id     cue direction value
1  P1 neutral        up   1.2
2  P2 neutral        up   1.3
3  P3 neutral        up   1.2
4  P1 neutral      down   2.1
5  P2 neutral      down   3.1
6  P3 neutral      down   2.1
7  P1   valid        up   1.2
8  P2   valid        up   1.3
9  P3   valid        up   1.2
10 P1   valid      down   2.1
11 P2   valid      down   3.1
12 P3   valid      down   2.1
13 P1 invalid        up   1.2
14 P2 invalid        up   1.3
15 P3 invalid        up   1.2
16 P1 invalid      down   2.1
17 P2 invalid      down   3.1
18 P3 invalid      down   2.1

Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?

You could use strsplit and the following regular expression to separate the string and then do.call and rbind to create a new dataframe with each part in its own column.

CODE TO MATCH SPECIFIC # OF UNDERSCORES

df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))

df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))

df_new

OUTPUT

                       X1  X2
1 TV_Banana_122_Afternoon Pre

Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+ to the first match in sub. See example below.

CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE

df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))

df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))

df2_new

OUTPUT

                            X1  X2
1 TV_Banana_122_Afternoon_Test Pre

Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.

CODE TO MATCH THE LAST UNDERSCORE

df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
df_new2

OUTPUT

                   X1  X2
1 TV_Banana_122_Afternoon Pre

Split Concatenated Column to Corresponding Column Positions