Split Character Columns and Get Names of Field in String

split character columns and get names of field in string

Using regex and the stringi packages:

setDT(myDT) # After creating data.table from structure()

library(stringi)

fields <- unique(unlist(stri_extract_all(regex = "[a-z]+(?==)", myDT$info)))
patterns <- sprintf("(?<=%s=)[^;]+", fields)
myDT[, (fields) := lapply(patterns, function(x) stri_extract(regex = x, info))]
myDT[, !"info"]

chr pos type end
1: chr1 <NA> 3 4
2: chr2 <NA> <NA> 6
3: chr4 TRUE 2 5

Edit: To get the correct type it seems (?) type.convert() can be used:

myDT[, (fields) := lapply(patterns, function(x) type.convert(stri_extract(regex = x, info), as.is = TRUE))]

R split column names with different occurrences of delimiter into strings and assign unique strings/string counts to a new dataframe

I think if you split at the "underscore, digit, underscore" it provides a solution to your statement above. This does eliminate the digit and the associated information. Does this matter?

names <- c("strainA_1_batch1", "strainA_2_batch2", "strainB_1_batch1", "strainC_1_batch2", "strainC_2_batch2", 
"strainD_a_1_batch1", "strainD_b_1_batch1")

#split at the underscore, digit and underscore
splitList <- strsplit(names, "_\\d_")

#convert to dataframe
df <-data.frame(t(as.data.frame.list(splitList)))

#clean up data.frame
rownames(df)<-NULL
names(df)<-c("Strain", "Batch")
df

#report
table(df$Strain)
table(df$Batch)

Another option is to replace the underscore on either side of the digit with a " " (or other character) and then split on the space.

names<-gsub("_(\\d)_", " \\1 ", names)

How to split a character column into multiple columns in R

You can get what you want with gsub:

gsub("^.* +- +([A-Za-z ]+) \\(.*$", "\\1", df$District)
[1] "North West" "North West" "North West" "North West" "North West" "North West"

The first argument to gsub ("^.* +- +([A-Za-z ]+) \(.*$") is a regular expression. It can be interpreted as follows:

From the the beginning of the string "^", match any characters ".*" followed by at least one space, a hyphen, and at least one space " +- +". Then capture the next text "()" that is made up of (at least one) letters and spaces "[A-Za-z ]+". Stop capturing when you reach a space followed by a parenthesis " \\(", then match everything until the end of the text ".*$".

The second argument of gsub, "\\1" says replace the text with the text that was captured by the parentheses.

To assign it to a variable:

df$name <- gsub("^.* +- +([A-Za-z ]+) \\(.*$", "\\1", df$District)

Split data frame string column into multiple columns

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

How to make a row the column names and split up a string into multiple rows

You could probably use this -

df = df[, c(1, 5)]

## Split on comma and add to dataframe
tmp = strsplit(df$molecules, ",")
df = cbind(df[, -2], do.call(rbind, tmp))

## Transpose the dataframe
df = t(df)
rownames(df) = NULL

How to split a column into multiple (non equal) columns in R

We could use cSplit from splitstackshape

library(splitstackshape)
cSplit(DF, "Col1",",")

-output

cSplit(DF, "Col1",",")
Col1_1 Col1_2 Col1_3 Col1_4
1: a b c <NA>
2: a b <NA> <NA>
3: a b c d

Split an string by number of characters in a column of a data frame to create multiple columns in R?

We can use separate

library(tidyr)
separate(df, ID, into = c("Spl_1", "Spl_2"), sep = 4, remove = FALSE)
# ID Spl_1 Spl_2 Var1 Var2
#1 0334KLM001 0334 KLM001 aa xx
#2 1334HDM002 1334 HDM002 zvv rr
#3 2334WEM003 2334 WEM003 qetr qwe
#4 3334OKT004 3334 OKT004 ff sdf
#5 4334WER005 4334 WER005 ee sdf
#6 5334BBC006 5334 BBC006 qly ssg
#7 6334QQQ007 6334 QQQ007 kk htj
#8 7334AAA008 7334 AAA008 uu yjy
#9 8334CBU009 8334 CBU009 ww wttt
#10 9334MLO010 9334 MLO010 aa dg

If we want 3 columns, we can pass a vector in sep

separate(df, ID, into = c("Spl_1", "Spl_2", "Spl_3"), sep = c(4,8), remove = FALSE)
# ID Spl_1 Spl_2 Spl_3 Var1 Var2
#1 0334KLM001 0334 KLM0 01 aa xx
#2 1334HDM002 1334 HDM0 02 zvv rr
#3 2334WEM003 2334 WEM0 03 qetr qwe
#4 3334OKT004 3334 OKT0 04 ff sdf
#5 4334WER005 4334 WER0 05 ee sdf
#6 5334BBC006 5334 BBC0 06 qly ssg
#7 6334QQQ007 6334 QQQ0 07 kk htj
#8 7334AAA008 7334 AAA0 08 uu yjy
#9 8334CBU009 8334 CBU0 09 ww wttt
#10 9334MLO010 9334 MLO0 10 aa dg

If the numbers at the beginning are not of fixed length, use extract

extract(df, ID, into = c("Spl_1", "Spl_2"), "^([0-9]+)(.*)", remove = FALSE)

and for 3 columns,

extract(df, ID, into = c("Spl_1", "Spl_2", "Spl_3"), "(.{4})(.{4})(.*)", remove = FALSE)

How to split a string into multiple columns by a given pattern?

If the strings are always in that same format, the following regular expression should work well:

library(stringr)
x <- "\r\n \r\n How to get a confirm ticket?\r\n \r\n I want to get a tatkal ticket confirm ..."
str_split(x, "(\r\n\\s*)+", simplify = TRUE)[, -1, drop = FALSE]
[,1] [,2]
[1,] "How to get a confirm ticket?" "I want to get a tatkal ticket confirm ..."

If your data actually comes from a table in a text file or from a web page, there are probably more convenient options.



Related Topics



Leave a reply



Submit