Read Observations in Fixed Width Files Spanning Multiple Lines in R

Read observations in fixed width files spanning multiple lines in R

You actually need read.fwf for this.

Set up some sample data

    txt <- 'Acura         Integra        Small   12.9 15.9 18.8 25 31 0 1 4 1.8 140 6300
2890 1 13.2 5 177 102 68 37 26.5 11 2705 0
Acura Legend Midsize 29.2 33.9 38.7 18 25 2 1 6 3.2 200 5500
2335 1 18.0 5 195 115 71 38 30.0 15 3560 0
Audi 90 Compact 25.9 29.1 32.3 20 26 1 1 6 2.8 172 5500
2280 1 16.9 5 180 102 67 37 28.0 14 3375 0'

Read using read.fwf - pay attention to widths argument. The widths should be a list of 2 integer vectors specifying element widths on multiple lines

DF <- read.fwf(textConnection(txt), 
widths = list(
c(14, 15, 8, 5, 5, 5, 3, 3, 2, 2, 2, 4, 4, 4),
c(5, 2, 5, 2, 4, 4, 3, 3, 5, 3, 5, 1)
),
header = FALSE)

Using pander package to pretty print the table since it has so many columns.

require(pander)
pandoc.table(DF)
##
## ---------------------------------------------------
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## ----- ------- ------- ---- ---- ---- ---- ---- ----
## Acura Integra Small 12.9 15.9 18.8 25 31 0
##
## Acura Legend Midsize 29.2 33.9 38.7 18 25 2
##
## Audi 90 Compact 25.9 29.1 32.3 20 26 1
## ---------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V10 V11 V12 V13 V14 V15 V16 V17
## ----- ----- ----- ----- ----- ----- ----- -----
## 1 4 1.8 140 6300 2890 1 13.2
##
## 1 6 3.2 200 5500 2335 1 18.0
##
## 1 6 2.8 172 5500 2280 1 16.9
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V18 V19 V20 V21 V22 V23 V24 V25
## ----- ----- ----- ----- ----- ----- ----- -----
## 5 177 102 68 37 26.5 11 2705
##
## 5 195 115 71 38 30.0 15 3560
##
## 5 180 102 67 37 28.0 14 3375
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----
## V26
## -----
## 0
##
## 0
##
## 0
## -----
##

How to tidy a fixed width file with headers every n (varies) rows?

One other possible solution (no tidyverse) is to read in the file per line, look for header rows and paste those rows at the end of rows without header. After, these lines are splitted and put into a data.frame.

lines <- readLines("asd.dat")

# last index + 1 for iteration
headers <- c(which(grepl("^4 ", lines)), length(lines) + 1)

pastedLines <- c()
for(i in 1:(length(headers) - 1)) {
pastedLines <- c(pastedLines,
paste(lines[(headers[i] + 1) : (headers[i + 1] - 1)], lines[headers[i]]))
}

DF <- as.data.frame(matrix(unlist(strsplit(pastedLines, "\\s+")), nrow = length(pastedLines), byrow=T))

Output:

           V1 V2 V3     V4 V5                  V6       V7
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah

How can I create a DataFrame with separate columns from a fixed width character vector input in R?

You can use textConnection to read file as text in read.fwf and supply the widths.

data <- read.fwf(textConnection(text), 
widths = c(12, 14, 20), strip.white = TRUE, skip = 3)
data
# V1 V2 V3
#1 AA A134 abcd
#2 AB A123 def
#3 AC A345 ghikl
#4 BA B134 jklmmm
#5 AD A987 mn

data

text <- c("           Report", "Group        ID           Name", "Number", 
"AA A134 abcd", "AB A123 def",
"AC A345 ghikl", "BA B134 jklmmm",
"AD A987 mn")

Read observations in fixed width files spanning multiple lines in R

You actually need read.fwf for this.

Set up some sample data

    txt <- 'Acura         Integra        Small   12.9 15.9 18.8 25 31 0 1 4 1.8 140 6300
2890 1 13.2 5 177 102 68 37 26.5 11 2705 0
Acura Legend Midsize 29.2 33.9 38.7 18 25 2 1 6 3.2 200 5500
2335 1 18.0 5 195 115 71 38 30.0 15 3560 0
Audi 90 Compact 25.9 29.1 32.3 20 26 1 1 6 2.8 172 5500
2280 1 16.9 5 180 102 67 37 28.0 14 3375 0'

Read using read.fwf - pay attention to widths argument. The widths should be a list of 2 integer vectors specifying element widths on multiple lines

DF <- read.fwf(textConnection(txt), 
widths = list(
c(14, 15, 8, 5, 5, 5, 3, 3, 2, 2, 2, 4, 4, 4),
c(5, 2, 5, 2, 4, 4, 3, 3, 5, 3, 5, 1)
),
header = FALSE)

Using pander package to pretty print the table since it has so many columns.

require(pander)
pandoc.table(DF)
##
## ---------------------------------------------------
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## ----- ------- ------- ---- ---- ---- ---- ---- ----
## Acura Integra Small 12.9 15.9 18.8 25 31 0
##
## Acura Legend Midsize 29.2 33.9 38.7 18 25 2
##
## Audi 90 Compact 25.9 29.1 32.3 20 26 1
## ---------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V10 V11 V12 V13 V14 V15 V16 V17
## ----- ----- ----- ----- ----- ----- ----- -----
## 1 4 1.8 140 6300 2890 1 13.2
##
## 1 6 3.2 200 5500 2335 1 18.0
##
## 1 6 2.8 172 5500 2280 1 16.9
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V18 V19 V20 V21 V22 V23 V24 V25
## ----- ----- ----- ----- ----- ----- ----- -----
## 5 177 102 68 37 26.5 11 2705
##
## 5 195 115 71 38 30.0 15 3560
##
## 5 180 102 67 37 28.0 14 3375
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----
## V26
## -----
## 0
##
## 0
##
## 0
## -----
##

How to read multiple files to save and count number of variables in each using a map_* function from purrr?

Fake setup:

library(dplyr)
library(purrr)

set.seed(42)
df <- tibble(file = sprintf("file%i.xlsx", 1:3)) %>%
mutate(data = map(file, ~ mtcars[,sample(11,size=7)]))
df
# # A tibble: 3 x 2
# file data
# <chr> <list>
# 1 file1.xlsx <df[,7] [32 x 7]>
# 2 file2.xlsx <df[,7] [32 x 7]>
# 3 file3.xlsx <df[,7] [32 x 7]>

The work:

df %>%
mutate(
var.list = map(data, colnames),
var.n = map_int(var.list, ~ length(unique(.)))
) %>%
# and just to show the differencs
mutate(
var.names = map_chr(var.list, toString)
)
# # A tibble: 3 x 5
# file data var.list var.n var.names
# <chr> <list> <list> <int> <chr>
# 1 file1.xlsx <df[,7] [32 x 7]> <chr [7]> 7 mpg, drat, carb, am, cyl, hp, qsec
# 2 file2.xlsx <df[,7] [32 x 7]> <chr [7]> 7 gear, mpg, vs, qsec, hp, carb, drat
# 3 file3.xlsx <df[,7] [32 x 7]> <chr [7]> 7 hp, gear, cyl, qsec, disp, mpg, wt


Related Topics



Leave a reply



Submit