R Split Column by Fixed Width

R split column by fixed width

One option would be to use extract from tidyr

library(tidyr)
extract(extract(df, DDMMYY, c("DD","MM", "YY"), "(..)(..)(..)",
convert=TRUE), HHMM, c("HH", "MM"), "(..)(..)", convert=TRUE)
# ID DD MM YY HH MM
#1 1 10 3 10 22 5
#2 2 11 3 10 10 45
#3 3 12 3 10 11 10
#4 4 13 3 10 22 50

Or you could use strsplit from base R

 df[,c("DD", "MM", "YY", "HH", "MM")] <- do.call(data.frame,lapply(df[,-1],
function(x) do.call(rbind,lapply(strsplit(as.character(x),
'(?<=..)(?=..)', perl=TRUE), as.numeric))))

df[,-(2:3)]
# ID DD MM YY HH MM.1
#1 1 10 3 10 22 5
#2 2 11 3 10 10 45
#3 3 12 3 10 11 10
#4 4 13 3 10 22 50

Text to columns by fixed width in R

I regularly use these two functions:

substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}

and

substrLeft <- function(x, n){
substr(x, 1,n)
}

Which cutoff n characters left or right of the string

Column split by fixed length in R

1) read.fwf This can be done using read.fwf as mentioned in the comments. This is straight-forward although it requires tedious determination of the field widths.

w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)
read.fwf("myfile.dat", w)

The remaining methods assume there is at least one space between each field. That is at least the case in the sample input.

2) read.pattern() This requires the development version of gsubfn package. This does require a long regular expression although its a reasonably straight forward one matching spaces (" +"), non-spaces ("\\S+") and optional non-spaces ("\\S*").

library(gsubfn)
library(devtools) # source_url supports https whereas source only supports http
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo

# the third capture group is "\\S*" whereas the others are "\\S+"
pat_ <- rep(c("(\\S+) +", "(\\S*) +", "(\\S+) +", "(\\S+)"), c(2, 1, 7, 1))
pat <- paste0( c( "^ *", pat_, " *$"), collapse = "")
read.pattern("myfile.dat", pattern = pat, as.is = TRUE)

3) gsubfn() With slightly more work we can use gsubfn() in the same package in which case we can use the CRAN version of the package. Use pat defined above:

Lines <- readLines("myfile.dat")
tmp <- gsubfn(pat, ... ~ paste(..., sep = ","), Lines)
read.table(text = tmp, sep = ",", as.is = TRUE)

4) count.fields We can explicitly count fields in each line to get another solution. This requires no addon packages. It avoids manual field width counting but does require more code.

Lines <- readLines("myfile.dat")

k <- count.fields("myfile.dat")
Lines2 <- gsub("^ *| *$", "", Lines) # trim whitespace from beginning and end
Lines3 <- ifelse(k == 10, sub("^(\\S+ *\\S+)", "\\1,", Lines2), Lines2) # insert extra ,

Lines4 <- gsub(" +", ",", Lines3) # replace each string of spaces with a ,
read.table(text = Lines4, sep = ",", as.is = TRUE)

5) sub Slightly shorter and still without addon packages is this one. We use sub to insert an extra comma into the short lines and then replace each string of spaces with a comma and read:

Lines <- readLines("myfile.dat")
pat2 <- "^( *(\\S+ +){2})((\\S+ +){7}\\S+) *$"
g <- gsub(" +", ",", sub(pat2, "\\1,\\3", Lines))

read.table(text = g, sep = ",", as.is = TRUE)

REVISED Additional solutions.

How can I create a DataFrame with separate columns from a fixed width character vector input in R?

You can use textConnection to read file as text in read.fwf and supply the widths.

data <- read.fwf(textConnection(text), 
widths = c(12, 14, 20), strip.white = TRUE, skip = 3)
data
# V1 V2 V3
#1 AA A134 abcd
#2 AB A123 def
#3 AC A345 ghikl
#4 BA B134 jklmmm
#5 AD A987 mn

data

text <- c("           Report", "Group        ID           Name", "Number", 
"AA A134 abcd", "AB A123 def",
"AC A345 ghikl", "BA B134 jklmmm",
"AD A987 mn")

Split dataframe based on one column in r, with a non-fixed width column

You can use tidyr::separate_rows:

library(tidyr)
separate_rows(data, treatments)

# id treatments
#1 1 1
#2 1 2
#3 1 3
#4 2 2
#5 2 3
#6 3 8
#7 3 9
#8 3 1
#9 3 2
#10 3 4

Separating a string according to fixed lengths in R, to create columns

Since you mention fixed length delimiting, maybe give read.fwf a try.

read.fwf(textConnection(text), widths = c(3, 21, 13), 
col.names = c("Ticker", "Company", "Date"))
# Ticker Company Date
# 1 ABC ABC BROWNIES COMPANY 1/31/2009
# 2 BCD BCD BROWNIES COMPANY 1/31/2009

You can mess around with the middle 21 value to get it to work on all the data.

Another possibility is to split on three or more spaces.

data.frame(do.call(rbind, strsplit(text, " {3,}")))
# X1 X2 X3
# 1 ABC ABC BROWNIES COMPANY 1/31/2009
# 2 BCD BCD BROWNIES COMPANY 1/31/2009

Using R to separate txt file based on fixed width

We can read the file with readLines and then use read.table

lines <- readLines('yourfile.csv')
read.table(text = gsub(" ([A-Z]\\w+\\s+?[A-Za-z]+?) ", " '\\1' ", lines),
sep = " ", header = FALSE,
colClasses = rep("character", 7), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5 V6 V7
#1 2016SVIC NN 01407 MOUNT ALEXANDER 7-12 COLLEGE 3031


Related Topics



Leave a reply



Submit