Faster Way to Read Fixed-Width Files

Faster way to read fixed-width files

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000

#creating the contents at random
contents = write.table(
replicate(
rrows,
paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
),
file = "testfwf.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE
)

#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
beg = endpoints[-(n_fields + 1L)],
end = endpoints[-1L] - 1L
)

dict = data.frame(
column = paste0("V", seq_len(length(endpoints)) - 1L)),
start = endpoints[-length(endpoints)] - 1,
length = diff(endpoints)
)

write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

Here is the benchmarking code:

library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)

microbenchmark(
times = 5L,
utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
in2csv = fread(cmd = sprintf(
"in2csv -f fixed -s %s %s",
"testdic.csv", "testfwf.txt"
)),
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
LaF = {
my.data.laf = laf_open_fwf(
'testfwf.txt',
column_widths = diff(endpoints),
column_types = rep("character", length(endpoints) - 1L)
)
my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
as.data.frame(my.data)
},
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
}
)

And the output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c
# in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a
# readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a
# LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b
# fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

EDIT: Adding some alternatives

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)

microbenchmark(
times = 5L,
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
},
iotools = input.file(
"testfwf.txt", formatter = dstrfw,
col_types = rep("character", length(endpoints) - 1L),
widths = diff(endpoints)
),
awk = fread(header = FALSE, cmd = sprintf(
"awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
paste(diff(endpoints), collapse = " ")
))
)

And the new output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a
# fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a
# iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a
# awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b

So it appears iotools is both very fast and very consistent.

Read fixed width text file

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

V1 V2 V3 V4 V5 V6 V7 V8 V9
1 03JAN1990 23.4 -0.4 25.1 -0.3 26.6 0.0 28.6 0.3
2 10JAN1990 23.4 -0.8 25.2 -0.3 26.6 0.1 28.6 0.3
3 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5 31JAN1990 25.1 -0.2 25.8 -0.2 26.7 0.1 28.4 0.2
6 07FEB1990 25.8 0.2 26.1 -0.1 26.8 0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

How to efficiently parse fixed width files?

Using the Python standard library's struct module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.

from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

Output:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Update

As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct version).

What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)

Which is equivalent to a statement like the following, depending on the values of i and j in the for loop, to something looking like this:

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)

However the latter executes more than twice as fast since the slice boundaries are all constants.

Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval() function:

def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code.
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

How to read a fixed width file knowing column names but not the widths?

In order to solve this problem I came across a readr function read_fwf() which takes file name as an argument and another argument fwf_empty() specifying the whether the fix width be guess or not.

Say, my file name is fixed_width_file.csv, and I have a million rows. I would read the file just by using the column names.

library(readr)
read_fwf("fixed_width_file.csv",
fwf_empty("fixed_width_file.csv",
col_names = c("Name", "Income")),
skip = 1)

Check to see that the columns are aligned by looking at head of the data.frame.

I will update the answer as I know more.

Reading big data with fixed width

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

Option 1: csvkit + your favorite method to quickly read large data

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length
name,0,30
birthday,30,10
age,40,3

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

Option 2: sqldf using substr

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")

## Construct your "substr" command
GetMe <- paste("select",
paste("substr(V1, ", temp$start, ", ",
temp$length, ") `", temp$column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))

Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))

widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")

GetMe <- paste("select",
paste("substr(V1, ", start, ", ",
widths, ") `", column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")

library(sqldf)

## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)

Read in txt file with fixed width columns

Use read_fwf instead of read_csv.

[read_fwf reads] a table of fixed-width formatted lines into DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
(0, 44),
(46, 47),
(48, 49),
(50, 51),
(52, 53),
(54, 55),
(56, 57),
(58, 59),
(60, 66),
(67, 73),
(74, 77),
(78, 80),
(81, 84),
(85, 87),
(88, 90),
(91, 95),
(96, 99),
(100, 103),
(104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
"Vehicle Name",
"Is Sports Car",
"Is SUV",
"Is Wagon",
"Is Minivan",
"Is Pickup",
"Is All-Wheel Drive",
"Is Rear-Wheel Drive",
"Suggested Retail Price",
"Dealer Cost",
"Engine Size (litres)",
"Number of Cylinders",
"Horsepower",
"City Miles Per Gallon",
"Highway Miles Per Gallon",
"Weight (pounds)",
"Wheel Base (inches)",
"Lenght (inches)",
"Width (inches)",
)

And the output for print(df) would be:

                        Vehicle Name  ...  Width (inches)
0 Chevrolet Aveo LS 4dr hatch ... 66
1 Chevrolet Cavalier 2dr ... 69
2 Chevrolet Cavalier 4dr ... 68
3 Chevrolet Cavalier LS 2dr ... 69
4 Dodge Neon SE 4dr ... 67
.. ... ... ...
422 Nissan Titan King Cab XE ... *
423 Subaru Baja ... *
424 Toyota Tacoma ... *
425 Toyota Tundra Regular Cab V6 ... *
426 Toyota Tundra Access Cab V6 SR5 ... *

[427 rows x 19 columns]

Column names and specifications retrieved from here:

  • http://jse.amstat.org/datasets/04cars.txt

Note: Don't forget to specify where each column starts and ends. Without using colspecs, pandas is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv files (with specs and without):

Sample Image

Read list of files with inconsistent delimiter/fixed width

See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.

intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))

Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to @Dorton).

Variant A produces a column of dataframes inside the dataframe:

intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)

Variant B produces a list of dataframes named with the corresponding compound:

intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)



Related Topics



Leave a reply



Submit