Reading Big Data with Fixed Width

Reading big data with fixed width

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

Option 1: `csvkit` + your favorite method to quickly read large data

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length
    name,0,30 
    birthday,30,10 
    age,40,3

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

Option 2: `sqldf` using `substr`

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$start, ", ",
                     temp$length, ") `", temp$column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))

Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))

widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")

GetMe <- paste("select", 
               paste("substr(V1, ", start, ", ",
                     widths, ") `", column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

library(sqldf)

## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)

Faster way to read fixed-width files

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000
    
#creating the contents at random
contents = write.table(
  replicate(
    rrows,
    paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
  ),
  file = "testfwf.txt",
  quote = FALSE, row.names = FALSE, col.names = FALSE
)
    
#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
  c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
  beg = endpoints[-(n_fields + 1L)], 
  end = endpoints[-1L] - 1L
)
    
dict = data.frame(
  column = paste0("V", seq_len(length(endpoints)) - 1L)),
  start = endpoints[-length(endpoints)] - 1,
  length = diff(endpoints)
)
    
write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

Here is the benchmarking code:

library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)
    
microbenchmark(
  times = 5L,
  utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
  in2csv = fread(cmd = sprintf(
    "in2csv -f fixed -s %s %s",
    "testdic.csv", "testfwf.txt"
  )),
  readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
  LaF = {
    my.data.laf = laf_open_fwf(
      'testfwf.txt', 
      column_widths = diff(endpoints),
      column_types = rep("character", length(endpoints) - 1L)
    )
    my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
    as.data.frame(my.data)
  },
  fread = {
    DT = fread("testfwf.txt", header = FALSE, sep = "\n")
    DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]
  }
)

And the output:

# Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#   utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598     5   c
#  in2csv  67.74065  68.56549  69.60069  70.11774  70.18746  71.39210     5 a  
#   readr  10.57945  11.32205  15.70224  14.89057  19.54617  22.17298     5 a  
#     LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798     5  b 
#   fread  14.42617  15.44693  26.09877  15.76016  20.45481  64.40581     5 a

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

EDIT: Adding some alternatives

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)
    
microbenchmark(
  times = 5L,
  readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
  fread = {
    DT = fread("testfwf.txt", header = FALSE, sep = "\n")
    DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]
  },
  iotools = input.file(
    "testfwf.txt", formatter = dstrfw, 
    col_types = rep("character", length(endpoints) - 1L), 
    widths = diff(endpoints)
  ),
  awk = fread(header = FALSE, cmd = sprintf(
    "awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
    paste(diff(endpoints), collapse = " ")
  ))
)

And the new output:

# Unit: seconds
#     expr       min        lq      mean    median        uq       max neval cld
#    readr  7.892527  8.016857 10.293371  9.527409  9.807145 16.222916     5  a 
#    fread  9.652377  9.696135  9.796438  9.712686  9.807830 10.113160     5  a 
#  iotools  5.900362  7.591847  7.438049  7.799729  7.845727  8.052579     5  a 
#      awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156     5   b

So it appears iotools is both very fast and very consistent.

Reading fixed width format data into R with entries exceeding column width

Edit:

Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.

Step 1 Read the file in and assign the split records a common record id

library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
  rownames_to_column() %>%
  mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))

df[37:40, ]
#> # A tibble: 4 x 8
#>   rowname                                                    X1    X2
#>     <chr>                                                 <chr> <int>
#> 1      37 999 13900 Bismarck, ND                            856   629
#> 2      38          999 13980 Blacksburg-Christiansburg-Radford,    NA
#> 3      39   VA                                              543   455
#> 4      40 145 14010 Bloomington, IL                         342   214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> #   record <chr>

Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.

df <- df %>%
  mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
  extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
  mutate(Name = trimws(Name)) %>%
  filter((lead(record) != record) | rowname == 1) %>%
  select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)

df[37:39, ]
#> # A tibble: 3 x 9
#>     CSA  CBSA                                 Name Total    X2    X3    X4
#>   <chr> <chr>                                <chr> <chr> <int> <int> <int>
#> 1   999 13900                         Bismarck, ND   856   629    16     6
#> 2   999 13980 Blacksburg-Christiansburg-Radford,VA   543   455     0     4
#> 3   145 14010                      Bloomington, IL   342   214     4     0
#> # ... with 2 more variables: X5 <int>, X6 <int>

Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.

Example data

library(readr)

# example data
txt <- "                                                                                        Num of
                                                                                        Struc-
                                                                                        tures
                                                                                        With
                                                                      3 and 4  5 Units  5 Units
CSA CBSA  Name                                   Total 1 Unit 2 Units   Units  or more  or more

999 10180 Abilene, TX                             306     298       8       0       0       0
184 10420 Akron, OH                               909     905       0       4       0       0" 

write_file(txt, "example.txt")

Solution

col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2", 
               "units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#>     CSA blank_1  CBSA blank_2        Name Total units_1 units_2
#>   <int>   <chr> <int>   <chr>       <chr> <int>   <int>   <int>
#> 1   999    <NA> 10180    <NA> Abilene, TX   306     298       8
#> 2   184    <NA> 10420    <NA>   Akron, OH   909     905       0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> #   num_struc_5_or_more <int>

How to read in a specific column from a large fixed-width file in R

You can use a connection and process the file in blocks:

Replicate your data:

dat <-"10010100100002000000
00010010000001000000
10010000001002000000"

Process in blocks using a connection:

# Define a connection
con = textConnection(dat)

# Do the block update
linesPerUpdate <- 2
result <- character()
repeat {
  line <- readLines(con, linesPerUpdate)
  result <- c(result, substr(line, start=14, stop=14))
  if (length(line) < linesPerUpdate) break
}

# Close the connection
close(con)

The result:

result
[1] "2" "1" "2"

Read in txt file with fixed width columns

Use read_fwf instead of read_csv.

[read_fwf reads] a table of fixed-width formatted lines into DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
    (0, 44),
    (46, 47),
    (48, 49),
    (50, 51),
    (52, 53),
    (54, 55),
    (56, 57),
    (58, 59),
    (60, 66),
    (67, 73),
    (74, 77),
    (78, 80),
    (81, 84),
    (85, 87),
    (88, 90),
    (91, 95),
    (96, 99),
    (100, 103),
    (104, 106),
)

data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)

df.columns = (
    "Vehicle Name",
    "Is Sports Car",
    "Is SUV",
    "Is Wagon",
    "Is Minivan",
    "Is Pickup",
    "Is All-Wheel Drive",
    "Is Rear-Wheel Drive",
    "Suggested Retail Price",
    "Dealer Cost",
    "Engine Size (litres)",
    "Number of Cylinders",
    "Horsepower",
    "City Miles Per Gallon",
    "Highway Miles Per Gallon",
    "Weight (pounds)",
    "Wheel Base (inches)",
    "Lenght (inches)",
    "Width (inches)",
)

And the output for print(df) would be:

                        Vehicle Name  ...  Width (inches)
0        Chevrolet Aveo LS 4dr hatch  ...              66
1             Chevrolet Cavalier 2dr  ...              69
2             Chevrolet Cavalier 4dr  ...              68
3          Chevrolet Cavalier LS 2dr  ...              69
4                  Dodge Neon SE 4dr  ...              67
..                               ...  ...             ...
422         Nissan Titan King Cab XE  ...               *
423                      Subaru Baja  ...               *
424                    Toyota Tacoma  ...               *
425     Toyota Tundra Regular Cab V6  ...               *
426  Toyota Tundra Access Cab V6 SR5  ...               *

[427 rows x 19 columns]

Column names and specifications retrieved from here:

http://jse.amstat.org/datasets/04cars.txt

Note: Don't forget to specify where each column starts and ends. Without using colspecs, pandas is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv files (with specs and without):

Sample Image

Reading very large fixed(ish) width format txt files from SQL Server Export into R data.tables or likewise

For this particular file:

form <- read.table("SEARCHBASIS_format.txt", as.is = TRUE, skip = 2)
x <- read.table("SEARCHBASIS.txt", col.names = form$V7, as.is = TRUE)

If you sometimes have strings including spaces you'll almost certainly need to process the file externally first.

If you're planning to read really large files I'd suggest (presuming you have awk on your path):

x <- setNames(data.table::fread("awk '{$1=$1}1' SEARCHBASIS.txt"), form$V7)

If you want to use fixed widths you could use:

x <- setNames(fread("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"12 12 12 12 12\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt"), form$V7)

You can also pull the widths from the format file:

x <- setNames(fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, collapse = " "), "\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt")), form$V7)

Note $1=$1 forces awk to reevaluate the fields and the 1 at the end is effectively shorthand for print. I've also assumed you want to strip trailing spaces from each field.

On Windows you'll need to use single quotes in R and replace the single quotes within the command with " and the nested double quotes with "". So the last one above becomes:

x <- setNames(fread(paste0('gawk \"BEGIN {OFS = ""\t""; FIELDWIDTHS = ""', paste(form$V4, collapse = " "), '""} {for (i = 1; i<= NF; i++) {gsub(/ +$/, """", $i);}}1" SEARCHBASIS.txt')), form$V7)

For a cross-platform solution, you'll need to put your awk script in an external file:

stripSpace.awk

BEGIN {OFS="\t"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, "", $i);}}1

R code

x <- setNames(fread(paste0('gawk -v FIELDWIDTHS="', paste(form$V4, collapse = " "), '" -f stripSpace.awk SEARCHBASIS.txt')), form$V7)

Tested on Scientific Linux 6 and Windows 8.1

Read Data from a file with Fixed Width not at same places using Pandas

You can try the default guess and parse the data:

df = pd.read_fwf('a.csv', names=["Time", "Header", "Src->Dest", "C1"])
(df.drop('C1', axis=1)
   .join(df['C1'].str.extractall('(.{2})')[0].unstack('match'))
)

Output:

            Time    Header  Src->Dest   0   1   2   3   4   5   6   7   8   9   10  11
0   20:03:42.769    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
1   20:03:42.769    0x11    0x00->0x00  B2  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   20:03:42.815    0xAA    0x98->0x99  52  03  00  1C  0C  1C  0C  0E  0E  NaN NaN NaN
3   20:03:42.831    0x11    0x00->0x00  EE  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4   20:03:50.726    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
5   20:03:50.819    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
6   20:03:50.895    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
7   20:03:54.841    0xAA    0x98->0x99  12  02  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8   20:03:54.857    0x11    0x00->0x00  E0  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9   20:03:54.935    0xAA    0x99->0x98  12  03  00  41  01  02  00  0C  07  0A  00  NaN
10  20:03:54.935    0x11    0x00->0x00  AB  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11  20:03:55.013    0xAA    0x99->0x98  12  03  01  05  00  00  05  0C  00  00  03  NaN

Reading Big Data with Fixed Width