R Split a Column into Multiple Column by Pattern

R Split a column into multiple column by pattern

The approach here will vary significantly depending on whether this is actually how your strings look like or just an example. If they are always two letters and numbers, you can substring:

> df <- data.frame(col1 = c("ab 12 14 56", "xb 23 234 2342 2", "ad 23 45"))
>
> df$col1.1 <- sapply(df$col1, substring, 0, 2)
>
> df$col1.2 <- sapply(df$col1, substring, 3)
>
> df
col1 col1.1 col1.2
1 ab 12 14 56 ab 12 14 56
2 xb 23 234 2342 2 xb 23 234 2342 2
3 ad 23 45 ad 23 45

If the length and positions of the strings change, regex might be better suited. Using a base R approach, you can extract only the numbers or letters (keeping the white spaces):

> df <- data.frame(col1 = c("ab 12 14 56", "xb 23 234 2342 2", "ad 23 45"))
> df$col1.1 <- sapply(regmatches(df$col1, gregexpr("[a-zA-Z]", df$col1)), paste, collapse = "")
> df$col1.2 <- sapply(regmatches(df$col1, gregexpr("[0-9]\\s*", df$col1)), paste, collapse = "")
> df
col1 col1.1 col1.2
1 ab 12 14 56 ab 12 14 56
2 xb 23 234 2342 2 xb 23 234 2342 2
3 ad 23 45 ad 23 45

How to split a string into multiple columns by a given pattern?

If the strings are always in that same format, the following regular expression should work well:

library(stringr)
x <- "\r\n \r\n How to get a confirm ticket?\r\n \r\n I want to get a tatkal ticket confirm ..."
str_split(x, "(\r\n\\s*)+", simplify = TRUE)[, -1, drop = FALSE]
[,1] [,2]
[1,] "How to get a confirm ticket?" "I want to get a tatkal ticket confirm ..."

If your data actually comes from a table in a text file or from a web page, there are probably more convenient options.

How to split one column in to multiple columns with pattern in R

We can create an grouping variable with rep and split

split(df1$x,  rep(1:3, c(6, 9, 6)))
#$`1`
#[1] 1 2 3 4 5 6

#$`2`
#[1] 7 8 9 10 11 12 13 14 15

#$`3`
#[1] 16 17 18 19 20 21

A function can be created with arguments, 'n', and additional arguments with ...

f1 <- function(dat, n, ...) {

rgrp <- n * c(...)
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}

f1(df1, 2, 3, 4)
#$`1`
#[1] 1 2 3 4 5 6

#$`2`
#[1] 7 8 9 10 11 12 13 14

f1(df1, 3, 2, 3, 2)
#$`1`
#[1] 1 2 3 4 5 6

#$`2`
#[1] 7 8 9 10 11 12 13 14 15

#$`3`
#[1] 16 17 18 19 20 21

If the user submits a vector and we don't have n, then get the n from the length of the vector

f1 <- function(dat, vec) {
n <- length(vec)

rgrp <- n * vec
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}

f1(df1, 3:4)

If the user input 'n1', 'n2', we can use ...

 f1 <- function(dat, ...) {
vec <- c(...)
n <- length(vec)

rgrp <- n * vec
split(dat[[1]][seq_len(sum(rgrp))], rep(seq_len(n), rgrp))
}

f1(df1, 3, 4)

data

df1 <- structure(list(x = 1:21), class = "data.frame", row.names = c(NA, 
-21L))

Split data frame string column into multiple columns

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

How to split a character column into multiple columns in R

You can get what you want with gsub:

gsub("^.* +- +([A-Za-z ]+) \\(.*$", "\\1", df$District)
[1] "North West" "North West" "North West" "North West" "North West" "North West"

The first argument to gsub ("^.* +- +([A-Za-z ]+) \(.*$") is a regular expression. It can be interpreted as follows:

From the the beginning of the string "^", match any characters ".*" followed by at least one space, a hyphen, and at least one space " +- +". Then capture the next text "()" that is made up of (at least one) letters and spaces "[A-Za-z ]+". Stop capturing when you reach a space followed by a parenthesis " \\(", then match everything until the end of the text ".*$".

The second argument of gsub, "\\1" says replace the text with the text that was captured by the parentheses.

To assign it to a variable:

df$name <- gsub("^.* +- +([A-Za-z ]+) \\(.*$", "\\1", df$District)

Split a column of strings (with different patterns) based on two different conditions

Here's a tidyr solution:

library(tidyr)
data %>%
extract(string,
into = c("1","2"), # choose your own column labels
"(.*?)_([^_]+)$")
1 2
1 HFUFN-087836 661
2 207465-125 - IK_6 Mar 2009.docx 37484956

How the regex works:

The regex partitions the strings into two "capture groups" plus an underscore in-between:

  • (.*?): first capture group, matching any character (.) zero or more times (*) non-greedily (?)
  • _: a literal underscore
  • ([^_]+)$: the second capture group, matching any character that is not an underscore ([^_]) one or more times (+) at the very end of he string ($)

Data:

data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))

split column data frame into multiple columns

The spaces within quotes makes the column difficult to parse but it's easy to read. See my comments above and use read.table(file="sprint.m.df.txt", sep=" ") or if you really have to work with your df, then try read_delim or scan.

df8 <- readr::read_delim(df[,1], delim=" ", col_names =FALSE)
# OR
df8 <- data.frame(matrix(scan(text=df[,1], what=" "), ncol=8, byrow=TRUE))
colnames(df8) <- c("rank", "Time", "wind", "name", "country", "birthdate", "city", "date")
df8
rank Time wind name country birthdate city date
1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009
2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012
3 3 9.69 0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008
4 3 9.69 2 Tyson Gay USA 09.08.82 Shanghai 20.09.2009
5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012
6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009

Split a column into multiple columns based on string pattern (before delimiter)

The following uses the reshape2 package to get the results you're looking for. Note that since columns are cast into a long-format data.frame, where missing values exist, they're replaced with NAs (your question shows blank spaces where columns have two vs thee elements, but a true blank isn't possible in a data.frame as all rows need to filled with something, in this case NA where blank). The approach is as follows:
(1) use str_split to split your name/value pairs by "_" and return these to a data frame
(2) use dcast where the name value is function of your value string

library(reshape2)
head(df$V1)

df <- data.frame(V1=c("FOO1_Yu","FOO1_uN","FOO2_Yo","FOO2_yA","FOO10_nO","FOO10_Yes","FOO1_NoY"),stringsAsFactors = F)

splits <- lapply(df$V1,function(x)
{
if (!grepl("_",x))
{
print(paste("Skipping bad input=",x))
return (NULL)
} else {
pair <- unlist(strsplit(x,split="_"))
name <- pair[1]
value <- x
return (data.frame(name=name,value=value))
}
})

splits <- do.call("rbind",splits)

df <- dcast(splits,value ~ name)

The output results as follows:

      value     FOO1    FOO2     FOO10
1 FOO1_Yu FOO1_Yu <NA> <NA>
2 FOO1_uN FOO1_uN <NA> <NA>
3 FOO2_Yo <NA> FOO2_Yo <NA>
4 FOO2_yA <NA> FOO2_yA <NA>
5 FOO10_nO <NA> <NA> FOO10_nO
6 FOO10_Yes <NA> <NA> FOO10_Yes
7 FOO1_NoY FOO1_NoY <NA> <NA>

Split delimited strings in multiple columns and separate them into rows

We may do this in an easier way if we make the delimiter same

library(dplyr)
library(tidyr)
library(stringr)
to_expand %>%
mutate(first = str_replace(first, "~", "|")) %>%
separate_rows(first, second, sep = "\\|")
# A tibble: 2 x 2
first second
<chr> <chr>
1 a 1~2~3
2 b 4~5~6


Related Topics



Leave a reply



Submit