Reading big data with fixed width
Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:
First, if you're on a Unix system, you can get some information about your file by using the wc
command. For example wc -l TS_MATRICULA_RS.txt
will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt
will report the length of the longest line in your file. This might be useful to know. Similarly, head
and tail
would let you inspect the first and last 10 lines of your text file.
Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.
Option 1: csvkit
+ your favorite method to quickly read large data
csvkit
is a set of Python tools for working with CSV files. One of the tools is in2csv
, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.
The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv
man page) is:
column,start,length
name,0,30
birthday,30,10
age,40,3
Once you have created that file, you should be able to use something like:
in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv
From there, I would suggest looking into reading the data with fread
from "data.table" or using sqldf
.
Option 2: sqldf
using substr
Using sqldf
on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr
.
Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:
temp <- read.csv("mySchemaFile.csv")
## Construct your "substr" command
GetMe <- paste("select",
paste("substr(V1, ", temp$start, ", ",
temp$length, ") `", temp$column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")
## Load "sqldf"
library(sqldf)
## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum
. Here's a basic example, building on the first example from read.fwf
:
ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))
widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")
GetMe <- paste("select",
paste("substr(V1, ", start, ", ",
widths, ") `", column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")
library(sqldf)
## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)
Faster way to read fixed-width files
Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.
I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:
set.seed(21394)
wwidth = 400L
rrows = 1000000
#creating the contents at random
contents = write.table(
replicate(
rrows,
paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
),
file = "testfwf.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE
)
#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
beg = endpoints[-(n_fields + 1L)],
end = endpoints[-1L] - 1L
)
dict = data.frame(
column = paste0("V", seq_len(length(endpoints)) - 1L)),
start = endpoints[-length(endpoints)] - 1,
length = diff(endpoints)
)
write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)
I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf
), piping the result of in2csv
to fread
(@AnandaMahto's suggestion), Hadley's new readr
(read_fwf
), that using LaF
/ffbase
(@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread
with stri_sub
from stringi
.
Here is the benchmarking code:
library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)
microbenchmark(
times = 5L,
utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
in2csv = fread(cmd = sprintf(
"in2csv -f fixed -s %s %s",
"testdic.csv", "testfwf.txt"
)),
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
LaF = {
my.data.laf = laf_open_fwf(
'testfwf.txt',
column_widths = diff(endpoints),
column_types = rep("character", length(endpoints) - 1L)
)
my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
as.data.frame(my.data)
},
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
}
)
And the output:
# Unit: seconds
# expr min lq mean median uq max neval cld
# utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c
# in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a
# readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a
# LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b
# fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a
So it seems readr
and fread
+ stri_sub
are pretty competitive as the fastest; built-in read.fwf
is the clear loser.
Note that the real advantage of readr
here is that you can pre-specify column types; with fread
you'll have to type convert afterwards.
EDIT: Adding some alternatives
At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:
library(iotools)
microbenchmark(
times = 5L,
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
},
iotools = input.file(
"testfwf.txt", formatter = dstrfw,
col_types = rep("character", length(endpoints) - 1L),
widths = diff(endpoints)
),
awk = fread(header = FALSE, cmd = sprintf(
"awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
paste(diff(endpoints), collapse = " ")
))
)
And the new output:
# Unit: seconds
# expr min lq mean median uq max neval cld
# readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a
# fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a
# iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a
# awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b
So it appears iotools
is both very fast and very consistent.
Reading fixed width format data into R with entries exceeding column width
Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table
is more useful than read_fwf
. The following example is a tidyverse
approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines
.
Step 1 Read the file in and assign the split records a common record id
library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>
Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract
. Trim whitespace and remove the redundant records.
df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>
Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf
.
Example data
library(readr)
# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"
write_file(txt, "example.txt")
Solution
col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>
How to read in a specific column from a large fixed-width file in R
You can use a connection and process the file in blocks:
Replicate your data:
dat <-"10010100100002000000
00010010000001000000
10010000001002000000"
Process in blocks using a connection:
# Define a connection
con = textConnection(dat)
# Do the block update
linesPerUpdate <- 2
result <- character()
repeat {
line <- readLines(con, linesPerUpdate)
result <- c(result, substr(line, start=14, stop=14))
if (length(line) < linesPerUpdate) break
}
# Close the connection
close(con)
The result:
result
[1] "2" "1" "2"
Read in txt file with fixed width columns
Use read_fwf
instead of read_csv
.
[
read_fwf
reads] a table of fixed-width formatted lines into DataFrame.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
import pandas as pd
colspecs = (
(0, 44),
(46, 47),
(48, 49),
(50, 51),
(52, 53),
(54, 55),
(56, 57),
(58, 59),
(60, 66),
(67, 73),
(74, 77),
(78, 80),
(81, 84),
(85, 87),
(88, 90),
(91, 95),
(96, 99),
(100, 103),
(104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"
df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
"Vehicle Name",
"Is Sports Car",
"Is SUV",
"Is Wagon",
"Is Minivan",
"Is Pickup",
"Is All-Wheel Drive",
"Is Rear-Wheel Drive",
"Suggested Retail Price",
"Dealer Cost",
"Engine Size (litres)",
"Number of Cylinders",
"Horsepower",
"City Miles Per Gallon",
"Highway Miles Per Gallon",
"Weight (pounds)",
"Wheel Base (inches)",
"Lenght (inches)",
"Width (inches)",
)
And the output for print(df)
would be:
Vehicle Name ... Width (inches)
0 Chevrolet Aveo LS 4dr hatch ... 66
1 Chevrolet Cavalier 2dr ... 69
2 Chevrolet Cavalier 4dr ... 68
3 Chevrolet Cavalier LS 2dr ... 69
4 Dodge Neon SE 4dr ... 67
.. ... ... ...
422 Nissan Titan King Cab XE ... *
423 Subaru Baja ... *
424 Toyota Tacoma ... *
425 Toyota Tundra Regular Cab V6 ... *
426 Toyota Tundra Access Cab V6 SR5 ... *
[427 rows x 19 columns]
Column names and specifications retrieved from here:
- http://jse.amstat.org/datasets/04cars.txt
Note: Don't forget to specify where each column starts and ends. Without using colspecs
, pandas
is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv
files (with specs and without):
Reading very large fixed(ish) width format txt files from SQL Server Export into R data.tables or likewise
For this particular file:
form <- read.table("SEARCHBASIS_format.txt", as.is = TRUE, skip = 2)
x <- read.table("SEARCHBASIS.txt", col.names = form$V7, as.is = TRUE)
If you sometimes have strings including spaces you'll almost certainly need to process the file externally first.
If you're planning to read really large files I'd suggest (presuming you have awk on your path):
x <- setNames(data.table::fread("awk '{$1=$1}1' SEARCHBASIS.txt"), form$V7)
If you want to use fixed widths you could use:
x <- setNames(fread("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"12 12 12 12 12\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt"), form$V7)
You can also pull the widths from the format file:
x <- setNames(fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, collapse = " "), "\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt")), form$V7)
Note $1=$1
forces awk to reevaluate the fields and the 1 at the end is effectively shorthand for print
. I've also assumed you want to strip trailing spaces from each field.
On Windows you'll need to use single quotes in R and replace the single quotes within the command with " and the nested double quotes with "". So the last one above becomes:
x <- setNames(fread(paste0('gawk \"BEGIN {OFS = ""\t""; FIELDWIDTHS = ""', paste(form$V4, collapse = " "), '""} {for (i = 1; i<= NF; i++) {gsub(/ +$/, """", $i);}}1" SEARCHBASIS.txt')), form$V7)
For a cross-platform solution, you'll need to put your awk script in an external file:
stripSpace.awk
BEGIN {OFS="\t"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, "", $i);}}1
R code
x <- setNames(fread(paste0('gawk -v FIELDWIDTHS="', paste(form$V4, collapse = " "), '" -f stripSpace.awk SEARCHBASIS.txt')), form$V7)
Tested on Scientific Linux 6 and Windows 8.1
Read Data from a file with Fixed Width not at same places using Pandas
You can try the default guess and parse the data:
df = pd.read_fwf('a.csv', names=["Time", "Header", "Src->Dest", "C1"])
(df.drop('C1', axis=1)
.join(df['C1'].str.extractall('(.{2})')[0].unstack('match'))
)
Output:
Time Header Src->Dest 0 1 2 3 4 5 6 7 8 9 10 11
0 20:03:42.769 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
1 20:03:42.769 0x11 0x00->0x00 B2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20:03:42.815 0xAA 0x98->0x99 52 03 00 1C 0C 1C 0C 0E 0E NaN NaN NaN
3 20:03:42.831 0x11 0x00->0x00 EE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 20:03:50.726 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
5 20:03:50.819 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
6 20:03:50.895 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
7 20:03:54.841 0xAA 0x98->0x99 12 02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 20:03:54.857 0x11 0x00->0x00 E0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 20:03:54.935 0xAA 0x99->0x98 12 03 00 41 01 02 00 0C 07 0A 00 NaN
10 20:03:54.935 0x11 0x00->0x00 AB NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 20:03:55.013 0xAA 0x99->0x98 12 03 01 05 00 00 05 0C 00 00 03 NaN
Related Topics
Deleting Every N-Th Row in a Dataframe
How to Make a Timeseries Boxplot in R
Overlay Geom_Points() on Geom_Boxplot(Fill=Group)
How to Filter Data Frame with Conditions of Two Columns
How to Calculate Mean of All Columns, by Group
Ggpairs Plot with Heatmap of Correlation Values
Keeping Only Certain Rows of a Data Frame Based on a Set of Values
Substitute Dt1.X with Dt2.Y When Dt1.X and Dt2.X Match in R
Creating Accompanying Slides for Bookdown Project
Make a File Writable in Order to Add New Packages
Use Loop to Generate Section of Text in Rmarkdown
How to Format Data for Plotly Sunburst Diagram
Grouping Every N Minutes with Dplyr
Merge Overlapping Ranges into Unique Groups, in Dataframe
Extract the Coefficients for the Best Tuning Parameters of a Glmnet Model in Caret
Obtaining Connected Components of Neighboring Values
Remove "Showing 1 to N of N Entries" Shiny Dt
Difference Between 'Names(Df[1]) <- ' and 'Names(Df)[1] <- '