Reading Big Data with Fixed Width

Reading big data with fixed width

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

Option 1: csvkit + your favorite method to quickly read large data

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length
name,0,30
birthday,30,10
age,40,3

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

Option 2: sqldf using substr

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")

## Construct your "substr" command
GetMe <- paste("select",
paste("substr(V1, ", temp$start, ", ",
temp$length, ") `", temp$column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))

Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))

widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")

GetMe <- paste("select",
paste("substr(V1, ", start, ", ",
widths, ") `", column, "`",
sep = "", collapse = ", "),
"from fixed", sep = " ")

library(sqldf)

## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)

Faster way to read fixed-width files

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000

#creating the contents at random
contents = write.table(
replicate(
rrows,
paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
),
file = "testfwf.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE
)

#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
beg = endpoints[-(n_fields + 1L)],
end = endpoints[-1L] - 1L
)

dict = data.frame(
column = paste0("V", seq_len(length(endpoints)) - 1L)),
start = endpoints[-length(endpoints)] - 1,
length = diff(endpoints)
)

write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

Here is the benchmarking code:

library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)

microbenchmark(
times = 5L,
utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
in2csv = fread(cmd = sprintf(
"in2csv -f fixed -s %s %s",
"testdic.csv", "testfwf.txt"
)),
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
LaF = {
my.data.laf = laf_open_fwf(
'testfwf.txt',
column_widths = diff(endpoints),
column_types = rep("character", length(endpoints) - 1L)
)
my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
as.data.frame(my.data)
},
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
}
)

And the output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c
# in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a
# readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a
# LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b
# fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

EDIT: Adding some alternatives

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)

microbenchmark(
times = 5L,
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
},
iotools = input.file(
"testfwf.txt", formatter = dstrfw,
col_types = rep("character", length(endpoints) - 1L),
widths = diff(endpoints)
),
awk = fread(header = FALSE, cmd = sprintf(
"awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
paste(diff(endpoints), collapse = " ")
))
)

And the new output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a
# fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a
# iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a
# awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b

So it appears iotools is both very fast and very consistent.

Reading fixed width format data into R with entries exceeding column width

Edit:

Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.

Step 1 Read the file in and assign the split records a common record id


library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))

df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>

Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.


df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)

df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>

Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.

Example data

library(readr)

# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more

999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"

write_file(txt, "example.txt")

Solution

col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>

How to read in a specific column from a large fixed-width file in R

You can use a connection and process the file in blocks:

Replicate your data:

dat <-"10010100100002000000
00010010000001000000
10010000001002000000"

Process in blocks using a connection:

# Define a connection
con = textConnection(dat)

# Do the block update
linesPerUpdate <- 2
result <- character()
repeat {
line <- readLines(con, linesPerUpdate)
result <- c(result, substr(line, start=14, stop=14))
if (length(line) < linesPerUpdate) break
}

# Close the connection
close(con)

The result:

result
[1] "2" "1" "2"

Read in txt file with fixed width columns

Use read_fwf instead of read_csv.

[read_fwf reads] a table of fixed-width formatted lines into DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
(0, 44),
(46, 47),
(48, 49),
(50, 51),
(52, 53),
(54, 55),
(56, 57),
(58, 59),
(60, 66),
(67, 73),
(74, 77),
(78, 80),
(81, 84),
(85, 87),
(88, 90),
(91, 95),
(96, 99),
(100, 103),
(104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
"Vehicle Name",
"Is Sports Car",
"Is SUV",
"Is Wagon",
"Is Minivan",
"Is Pickup",
"Is All-Wheel Drive",
"Is Rear-Wheel Drive",
"Suggested Retail Price",
"Dealer Cost",
"Engine Size (litres)",
"Number of Cylinders",
"Horsepower",
"City Miles Per Gallon",
"Highway Miles Per Gallon",
"Weight (pounds)",
"Wheel Base (inches)",
"Lenght (inches)",
"Width (inches)",
)

And the output for print(df) would be:

                        Vehicle Name  ...  Width (inches)
0 Chevrolet Aveo LS 4dr hatch ... 66
1 Chevrolet Cavalier 2dr ... 69
2 Chevrolet Cavalier 4dr ... 68
3 Chevrolet Cavalier LS 2dr ... 69
4 Dodge Neon SE 4dr ... 67
.. ... ... ...
422 Nissan Titan King Cab XE ... *
423 Subaru Baja ... *
424 Toyota Tacoma ... *
425 Toyota Tundra Regular Cab V6 ... *
426 Toyota Tundra Access Cab V6 SR5 ... *

[427 rows x 19 columns]

Column names and specifications retrieved from here:

  • http://jse.amstat.org/datasets/04cars.txt

Note: Don't forget to specify where each column starts and ends. Without using colspecs, pandas is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv files (with specs and without):

Sample Image

Reading very large fixed(ish) width format txt files from SQL Server Export into R data.tables or likewise

For this particular file:

form <- read.table("SEARCHBASIS_format.txt", as.is = TRUE, skip = 2)
x <- read.table("SEARCHBASIS.txt", col.names = form$V7, as.is = TRUE)

If you sometimes have strings including spaces you'll almost certainly need to process the file externally first.

If you're planning to read really large files I'd suggest (presuming you have awk on your path):

x <- setNames(data.table::fread("awk '{$1=$1}1' SEARCHBASIS.txt"), form$V7)

If you want to use fixed widths you could use:

x <- setNames(fread("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"12 12 12 12 12\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt"), form$V7)

You can also pull the widths from the format file:

x <- setNames(fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, collapse = " "), "\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt")), form$V7)

Note $1=$1 forces awk to reevaluate the fields and the 1 at the end is effectively shorthand for print. I've also assumed you want to strip trailing spaces from each field.

On Windows you'll need to use single quotes in R and replace the single quotes within the command with " and the nested double quotes with "". So the last one above becomes:

x <- setNames(fread(paste0('gawk \"BEGIN {OFS = ""\t""; FIELDWIDTHS = ""', paste(form$V4, collapse = " "), '""} {for (i = 1; i<= NF; i++) {gsub(/ +$/, """", $i);}}1" SEARCHBASIS.txt')), form$V7)

For a cross-platform solution, you'll need to put your awk script in an external file:

stripSpace.awk

BEGIN {OFS="\t"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, "", $i);}}1

R code

x <- setNames(fread(paste0('gawk -v FIELDWIDTHS="', paste(form$V4, collapse = " "), '" -f stripSpace.awk SEARCHBASIS.txt')), form$V7)

Tested on Scientific Linux 6 and Windows 8.1

Read Data from a file with Fixed Width not at same places using Pandas

You can try the default guess and parse the data:

df = pd.read_fwf('a.csv', names=["Time", "Header", "Src->Dest", "C1"])
(df.drop('C1', axis=1)
.join(df['C1'].str.extractall('(.{2})')[0].unstack('match'))
)

Output:

            Time    Header  Src->Dest   0   1   2   3   4   5   6   7   8   9   10  11
0 20:03:42.769 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
1 20:03:42.769 0x11 0x00->0x00 B2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20:03:42.815 0xAA 0x98->0x99 52 03 00 1C 0C 1C 0C 0E 0E NaN NaN NaN
3 20:03:42.831 0x11 0x00->0x00 EE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 20:03:50.726 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
5 20:03:50.819 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
6 20:03:50.895 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
7 20:03:54.841 0xAA 0x98->0x99 12 02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 20:03:54.857 0x11 0x00->0x00 E0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 20:03:54.935 0xAA 0x99->0x98 12 03 00 41 01 02 00 0C 07 0A 00 NaN
10 20:03:54.935 0x11 0x00->0x00 AB NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 20:03:55.013 0xAA 0x99->0x98 12 03 01 05 00 00 05 0C 00 00 03 NaN


Related Topics



Leave a reply



Submit