Split concatenated column to corresponding column positions
Simply do:
splt <- strsplit(as.character(df$FOO),"\\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x){all_val[!(all_val %in% x)]<-NA;all_val}))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
string split values in two columns, and then concatenate them into a new column
You can use your function and the tidyverse
Use mutate()
with map2(.f = my.function)
to create a nested ID column containing a list column with all IDs per row(some have 1 ID, some have two in the example data). Then you can unnest_wider()
to create several different ID columns, which you can latter collapse using tidyr::unite()
library(tidyr)
library(dplyr)
library(stringr)
library(purrr)
df %>% mutate(ID=map2(Proteins, Positions.within.proteins, my.function))%>%
unnest_wider(ID, names_sep = '.')%>%
unite(contains('ID'), col='ID', remove = TRUE, sep=";", na.rm=TRUE)
# A tibble: 10 x 3
Proteins Positions.within.proteins ID
<chr> <chr> <chr>
1 Q99755;A2A3N6 276;223 Q99755_276;A2A3N6_223
2 O00329 708 O00329_708
3 O00444 41 O00444_41
4 O14965 162 O14965_162
5 O14976 175 O14976_175
6 Q6A1A2;O15530 84;111 Q6A1A2_84;O15530_111
7 O43318 63 O43318_63
8 O43526 628 O43526_628
9 O43930;P51817 78;78 O43930_78;P51817_78
10 O60331 270 O60331_270
In R, can you separate text to columns so that values align?
You can pivot your data a few times to align these values:
library(dplyr)
df %>%
tibble::rowid_to_column("id") %>%
tidyr::separate_rows(Col1, sep = "; ") %>%
tidyr::pivot_wider(id_cols = id,
names_from = Col1,
values_from = Col1) %>%
dplyr::select(-id) %>%
magrittr::set_colnames(paste0("Col", 1:ncol(.)))
Output
Col1 Col2 Col3 Col4 Col5
<chr> <chr> <chr> <chr> <chr>
1 camel cow giraffe panda zebra
2 camel NA giraffe NA zebra
3 NA NA NA panda zebra
How it works
rowid_to_column
keeps track of the row numbers so when the data are pivoted to a longer format, we don't lose track of which values belong in which rows.separate_rows
will separateCol1
and pivot the data to a longer format. Suggested by @Adam as an improvement.pivot_wider
aligns everything into the columns you specified.
Data
structure(list(Col1 = c("camel; cow; giraffe; panda; zebra",
"camel; giraffe; zebra", "panda; zebra")), class = "data.frame", row.names = c(NA,
-3L))
How to split a column of a variable number of concatenated tags into one column per tag?
The separate_rows
function from tidyr may help you get where you want. This splits the strings within tags
into separate rows instead of separate columns, which sets you up to use spread
.
To get the TRUE
/FALSE
result I created a new column of all TRUE
to use as the value column, and then filled the missing with FALSE
in spread
. In the end,spread
kept the blank cell as a column name, which I removed via select
. There may be a better way to do this (maybe convert to NA?).
library(tidyr)
library(dplyr)
data %>%
separate_rows(tags) %>%
mutate(tagslog = TRUE) %>%
spread(tags, tagslog, fill = FALSE) %>%
select(-one_of(""))
key A B C D E
* <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 a TRUE TRUE FALSE FALSE FALSE
2 b FALSE TRUE FALSE FALSE FALSE
3 c TRUE FALSE FALSE FALSE TRUE
4 d FALSE FALSE TRUE TRUE FALSE
5 e FALSE FALSE FALSE FALSE FALSE
You can almost get where you want with just separate_rows
and table
, but I still had that extra blank column that would need to be removed.
data %>%
separate_rows(tags) %>%
with(., table(key, tags) == 1)
tags
key A B C D E
a FALSE TRUE TRUE FALSE FALSE FALSE
b FALSE FALSE TRUE FALSE FALSE FALSE
c FALSE TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE FALSE TRUE TRUE FALSE
e TRUE FALSE FALSE FALSE FALSE FALSE
Splitting a column into multiple columns in R, when there is no separator
separate
can accept column positions in the sep
argument. This acts as if there were separators after columns 1, 2, ..., 7.
library(tidyr)
separate(x, y, into = paste0("y", 1:8), sep = 1:7)
giving:
y1 y2 y3 y4 y5 y6 y7 y8
1 1 2 3 4 5 6 7 8
2 8 7 6 5 4 3 2 1
Split word in column in R
With base R:
df$size <- substr(df$age,1,1)
df$age <- substr(df$age,2,2)
And to get the result in the column order you specified:
df[,c("fas","value","age","size","colony")]
fas value age size colony
1 C12:0 0.002221915 O L 7_13
2 C13:0 0.000770179 O L 7_13
3 C14:0 0.004525352 O L 7_13
4 C15:0 0.000738928 O L 7_13
5 C16:1a 0.002964627 O L 7_13
How to go from wide to long data, when each column need to be split into 3+ columns
library(dplyr)
library(reshape2)
library(tidyr)
reshape2::melt(messy,id.vars="id") %>%
tidyr::separate("variable",into = c("drop","cue","direction")) %>%
select(-drop)
id cue direction value
1 P1 neutral up 1.2
2 P2 neutral up 1.3
3 P3 neutral up 1.2
4 P1 neutral down 2.1
5 P2 neutral down 3.1
6 P3 neutral down 2.1
7 P1 valid up 1.2
8 P2 valid up 1.3
9 P3 valid up 1.2
10 P1 valid down 2.1
11 P2 valid down 3.1
12 P3 valid down 2.1
13 P1 invalid up 1.2
14 P2 invalid up 1.3
15 P3 invalid up 1.2
16 P1 invalid down 2.1
17 P2 invalid down 3.1
18 P3 invalid down 2.1
Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?
You could use strsplit
and the following regular expression to separate the string and then do.call
and rbind
to create a new dataframe with each part in its own column.
CODE TO MATCH SPECIFIC # OF UNDERSCORES
df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))
df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))
df_new
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon Pre
Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+
to the first match in sub
. See example below.
CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE
df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))
df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))
df2_new
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon_Test Pre
Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.
CODE TO MATCH THE LAST UNDERSCORE
df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
df_new2
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon Pre
Related Topics
Can Transparency Be Used with Postscript/Eps
Fastest Way to Do This Double Summation
How to Tell Which Packages I am Not Using in My R Script
How to Use "Cast" in Reshape Without Aggregation
Labelling the Plots with Images on Graph in Ggplot2
How to Force Seasonality from Auto.Arima
Legend Venn Diagram in Venneuler
Find the Nearest X,Y Coordinate Using R
How Create a Sequence of Strings with Different Numbers in R
Split Data.Frame into Groups by Column Name
Simple Comparing of Two Texts in R
R Looping Through in Survey Package
How to Turn the Filename into a Variable When Reading Multiple CSVS into R
Subset a Data Frame Based on Value Pairs Stored in Independent Ordered Vectors