Read observations in fixed width files spanning multiple lines in R
You actually need read.fwf
for this.
Set up some sample data
txt <- 'Acura Integra Small 12.9 15.9 18.8 25 31 0 1 4 1.8 140 6300
2890 1 13.2 5 177 102 68 37 26.5 11 2705 0
Acura Legend Midsize 29.2 33.9 38.7 18 25 2 1 6 3.2 200 5500
2335 1 18.0 5 195 115 71 38 30.0 15 3560 0
Audi 90 Compact 25.9 29.1 32.3 20 26 1 1 6 2.8 172 5500
2280 1 16.9 5 180 102 67 37 28.0 14 3375 0'
Read using read.fwf - pay attention to widths
argument. The widths
should be a list of 2 integer vectors specifying element widths on multiple lines
DF <- read.fwf(textConnection(txt),
widths = list(
c(14, 15, 8, 5, 5, 5, 3, 3, 2, 2, 2, 4, 4, 4),
c(5, 2, 5, 2, 4, 4, 3, 3, 5, 3, 5, 1)
),
header = FALSE)
Using pander
package to pretty print the table since it has so many columns.
require(pander)
pandoc.table(DF)
##
## ---------------------------------------------------
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## ----- ------- ------- ---- ---- ---- ---- ---- ----
## Acura Integra Small 12.9 15.9 18.8 25 31 0
##
## Acura Legend Midsize 29.2 33.9 38.7 18 25 2
##
## Audi 90 Compact 25.9 29.1 32.3 20 26 1
## ---------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V10 V11 V12 V13 V14 V15 V16 V17
## ----- ----- ----- ----- ----- ----- ----- -----
## 1 4 1.8 140 6300 2890 1 13.2
##
## 1 6 3.2 200 5500 2335 1 18.0
##
## 1 6 2.8 172 5500 2280 1 16.9
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V18 V19 V20 V21 V22 V23 V24 V25
## ----- ----- ----- ----- ----- ----- ----- -----
## 5 177 102 68 37 26.5 11 2705
##
## 5 195 115 71 38 30.0 15 3560
##
## 5 180 102 67 37 28.0 14 3375
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----
## V26
## -----
## 0
##
## 0
##
## 0
## -----
##
How to tidy a fixed width file with headers every n (varies) rows?
One other possible solution (no tidyverse) is to read in the file per line, look for header rows and paste those rows at the end of rows without header. After, these lines are splitted and put into a data.frame.
lines <- readLines("asd.dat")
# last index + 1 for iteration
headers <- c(which(grepl("^4 ", lines)), length(lines) + 1)
pastedLines <- c()
for(i in 1:(length(headers) - 1)) {
pastedLines <- c(pastedLines,
paste(lines[(headers[i] + 1) : (headers[i + 1] - 1)], lines[headers[i]]))
}
DF <- as.data.frame(matrix(unlist(strsplit(pastedLines, "\\s+")), nrow = length(pastedLines), byrow=T))
Output:
V1 V2 V3 V4 V5 V6 V7
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah
How can I create a DataFrame with separate columns from a fixed width character vector input in R?
You can use textConnection
to read file as text in read.fwf
and supply the widths.
data <- read.fwf(textConnection(text),
widths = c(12, 14, 20), strip.white = TRUE, skip = 3)
data
# V1 V2 V3
#1 AA A134 abcd
#2 AB A123 def
#3 AC A345 ghikl
#4 BA B134 jklmmm
#5 AD A987 mn
data
text <- c(" Report", "Group ID Name", "Number",
"AA A134 abcd", "AB A123 def",
"AC A345 ghikl", "BA B134 jklmmm",
"AD A987 mn")
Read observations in fixed width files spanning multiple lines in R
You actually need read.fwf
for this.
Set up some sample data
txt <- 'Acura Integra Small 12.9 15.9 18.8 25 31 0 1 4 1.8 140 6300
2890 1 13.2 5 177 102 68 37 26.5 11 2705 0
Acura Legend Midsize 29.2 33.9 38.7 18 25 2 1 6 3.2 200 5500
2335 1 18.0 5 195 115 71 38 30.0 15 3560 0
Audi 90 Compact 25.9 29.1 32.3 20 26 1 1 6 2.8 172 5500
2280 1 16.9 5 180 102 67 37 28.0 14 3375 0'
Read using read.fwf - pay attention to widths
argument. The widths
should be a list of 2 integer vectors specifying element widths on multiple lines
DF <- read.fwf(textConnection(txt),
widths = list(
c(14, 15, 8, 5, 5, 5, 3, 3, 2, 2, 2, 4, 4, 4),
c(5, 2, 5, 2, 4, 4, 3, 3, 5, 3, 5, 1)
),
header = FALSE)
Using pander
package to pretty print the table since it has so many columns.
require(pander)
pandoc.table(DF)
##
## ---------------------------------------------------
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## ----- ------- ------- ---- ---- ---- ---- ---- ----
## Acura Integra Small 12.9 15.9 18.8 25 31 0
##
## Acura Legend Midsize 29.2 33.9 38.7 18 25 2
##
## Audi 90 Compact 25.9 29.1 32.3 20 26 1
## ---------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V10 V11 V12 V13 V14 V15 V16 V17
## ----- ----- ----- ----- ----- ----- ----- -----
## 1 4 1.8 140 6300 2890 1 13.2
##
## 1 6 3.2 200 5500 2335 1 18.0
##
## 1 6 2.8 172 5500 2280 1 16.9
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------
## V18 V19 V20 V21 V22 V23 V24 V25
## ----- ----- ----- ----- ----- ----- ----- -----
## 5 177 102 68 37 26.5 11 2705
##
## 5 195 115 71 38 30.0 15 3560
##
## 5 180 102 67 37 28.0 14 3375
## -----------------------------------------------
##
## Table: Table continues below
##
##
## -----
## V26
## -----
## 0
##
## 0
##
## 0
## -----
##
How to read multiple files to save and count number of variables in each using a map_* function from purrr?
Fake setup:
library(dplyr)
library(purrr)
set.seed(42)
df <- tibble(file = sprintf("file%i.xlsx", 1:3)) %>%
mutate(data = map(file, ~ mtcars[,sample(11,size=7)]))
df
# # A tibble: 3 x 2
# file data
# <chr> <list>
# 1 file1.xlsx <df[,7] [32 x 7]>
# 2 file2.xlsx <df[,7] [32 x 7]>
# 3 file3.xlsx <df[,7] [32 x 7]>
The work:
df %>%
mutate(
var.list = map(data, colnames),
var.n = map_int(var.list, ~ length(unique(.)))
) %>%
# and just to show the differencs
mutate(
var.names = map_chr(var.list, toString)
)
# # A tibble: 3 x 5
# file data var.list var.n var.names
# <chr> <list> <list> <int> <chr>
# 1 file1.xlsx <df[,7] [32 x 7]> <chr [7]> 7 mpg, drat, carb, am, cyl, hp, qsec
# 2 file2.xlsx <df[,7] [32 x 7]> <chr [7]> 7 gear, mpg, vs, qsec, hp, carb, drat
# 3 file3.xlsx <df[,7] [32 x 7]> <chr [7]> 7 hp, gear, cyl, qsec, disp, mpg, wt
Related Topics
How to Change Line Width in Ggplot
How to Select R Data.Table Rows Based on Substring Match (A La SQL Like)
How to Properly Document S4 Methods Using Roxygen2
Appending a List to a List of Lists in R
How to Add Elements to a List in R (Loop)
Catching an Error and Then Branching Logic
Show Correlations as an Ordered List, Not as a Large Matrix
R: What Are Operators Like %In% Called and How to Learn About Them
Merging More Than 2 Dataframes in R by Rownames
What Does "Not Run" Mean in R Help Pages
Indicating the Statistically Significant Difference in Bar Graph Using R
Create an R Package That Depends on Another R Package Located on Github
How to Add Different Trend Lines in R
Create a Ranking Variable with Dplyr
R for Loop Skip to Next Iteration Ifelse