Read Fixed Width Text File

Read fixed width text file

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

Read in txt file with fixed width columns

Use read_fwf instead of read_csv.

[read_fwf reads] a table of fixed-width formatted lines into DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
    (0, 44),
    (46, 47),
    (48, 49),
    (50, 51),
    (52, 53),
    (54, 55),
    (56, 57),
    (58, 59),
    (60, 66),
    (67, 73),
    (74, 77),
    (78, 80),
    (81, 84),
    (85, 87),
    (88, 90),
    (91, 95),
    (96, 99),
    (100, 103),
    (104, 106),
)

data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)

df.columns = (
    "Vehicle Name",
    "Is Sports Car",
    "Is SUV",
    "Is Wagon",
    "Is Minivan",
    "Is Pickup",
    "Is All-Wheel Drive",
    "Is Rear-Wheel Drive",
    "Suggested Retail Price",
    "Dealer Cost",
    "Engine Size (litres)",
    "Number of Cylinders",
    "Horsepower",
    "City Miles Per Gallon",
    "Highway Miles Per Gallon",
    "Weight (pounds)",
    "Wheel Base (inches)",
    "Lenght (inches)",
    "Width (inches)",
)

And the output for print(df) would be:

                        Vehicle Name  ...  Width (inches)
0        Chevrolet Aveo LS 4dr hatch  ...              66
1             Chevrolet Cavalier 2dr  ...              69
2             Chevrolet Cavalier 4dr  ...              68
3          Chevrolet Cavalier LS 2dr  ...              69
4                  Dodge Neon SE 4dr  ...              67
..                               ...  ...             ...
422         Nissan Titan King Cab XE  ...               *
423                      Subaru Baja  ...               *
424                    Toyota Tacoma  ...               *
425     Toyota Tundra Regular Cab V6  ...               *
426  Toyota Tundra Access Cab V6 SR5  ...               *

[427 rows x 19 columns]

Column names and specifications retrieved from here:

http://jse.amstat.org/datasets/04cars.txt

Note: Don't forget to specify where each column starts and ends. Without using colspecs, pandas is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv files (with specs and without):

Sample Image

pyspark parse fixed width text file

Spark's substr function can handle fixed-width columns, for example:

df = spark.read.text("/tmp/sample.txt")
df.select(
    df.value.substr(1,3).alias('id'),
    df.value.substr(4,8).alias('date'),
    df.value.substr(12,3).alias('string'),
    df.value.substr(15,4).cast('integer').alias('integer')
).show()

will result in:

+---+--------+------+-------+
| id|    date|string|integer|
+---+--------+------+-------+
|001|01292017|   you|   1234|
|002|01302017|    me|   5678|
+---+--------+------+-------+

Having splitted columns you can reformat and use them as in normal spark dataframe.

Reading fixed width format file in R

Here's a command that should read in the data

dd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
    widths=c(10, rep(c(9,4),4)), skip=4)

Note that the widths need to account for all characters in each line, so even if there are blank spaces between columns, you need to assign those to one of the columns.

Then I can't think of a super-clean way to get the headers. This works but it's ugly and makes assumptions

hd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for", 
    widths=c(10,rep(c(9,4),4)), skip=2, nrow=2, as.is=T)

trim <- function(x) gsub("^\\s+|\\s+$","",x)
main <- paste0(trim(hd[1,seq(2, ncol(hd), by=2)]), trim(hd[1,seq(3, ncol(hd), by=2)]))
sub <- trim(as.vector(hd[2,]))

names(dd) <- make.names(c(sub[1],paste(rep(main, each=2), sub[-1])))

and finally, you can make a proper date value with

dd$Week <- as.Date(as.character(dd$Week), "%d%b%Y")

You shouldn't be using the sep= parameter at all. What read.fwf actually does is re-write the fixed with file as a delimited file using sep as the delimiter and then reads the delimited file with the more standard read.table(). The default value of sep="\t" is usually safe as generally you do not have tabs in your actual data.

Python conditionally read fixed width text file and create DataFrame

Here you go. Sample Image

import pandas as pd
import numpy as np

def define_empty_dict():
    TYPE___A = {'ST': None, 'COUNT': None}

    TYPE___B = {'KEY': None}

    TYPE___C = {'C_INFO1': None, 'C_INFO2': None, 'C_INFO3': None, 'C_INFO4': None}

    TYPE___D = {'DOB': None, 'GENDER': None}

    TYPE___E = {'E_INFO': None}

    TYPE___F = {'F_INFO1': None, 'F_INFO2': None, 'F_INFO3': None, 'F_INFO4': None, 'F_INFO5': None, 'F_INFO6': None, 'F_INFO7': None, 'F_INFO8': None, 'F_INFO9': None}

    TYPE___G = {'G_INFO': None}

    TYPE___J = {'J_INFO': None}

    TYPE___K = {'K_INFO': None}

    TYPE___L = {'L_INFO': None}
    return TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L

TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
output = pd.DataFrame(columns = rowDict.keys())

##


with open("test.txt", 'r') as file:

    for i, line in enumerate(file):

        if line[:8] == "##TSTA##":
            continue
        elif line[19:20] == "A":
            ID_A = line[0:15]
            TYPE___A['ST']    = line[20:22]
            TYPE___A['COUNT'] = line[22:26]
        elif (line[19:20] == "B") & (line[0:15] == ID_A):
            TYPE___B['KEY']  = line[20:80]
        elif (line[19:20] == "C") & (line[0:15] == ID_A):
            if "number_ref_C" not in globals():
                number_ref_C = int(line[15:19])
                c = 1
                TYPE___C[f'C_INFO{c}'] = line[20:60]
            else :
                c += 1
                TYPE___C[f'C_INFO{c}'] = line[20:60]
        elif (line[19:20] == "D") & (line[0:15] == ID_A):
            TYPE___D['DOB']  = line[20:28]
            TYPE___D['GENDER']  = line[35:36]
        elif (line[19:20] == "E") & (line[0:15] == ID_A):
            TYPE___E['E_INFO']  = line[20:39]
        elif (line[19:20] == "F") & (line[0:15] == ID_A):
            if "number_ref_F" not in globals():
                number_ref_F = int(line[15:19])
                f = 1
                TYPE___F[f'F_INFO{f}'] = line[20:60]
            else :
                f += 1
                TYPE___F[f'F_INFO{f}'] = line[20:60]
        elif (line[19:20] == "G") & (line[0:15] == ID_A):
            TYPE___G['G_INFO']  = line[20:39]
        elif (line[19:20] == "J") & (line[0:15] == ID_A):
            TYPE___J['J_INFO']  = line[20:39]
        elif (line[19:20] == "K") & (line[0:15] == ID_A):
            TYPE___K['K_INFO']  = line[20:39]
        elif (line[19:20] == "L") & (line[0:15] == ID_A):
            TYPE___L['L_INFO']  = line[20:39]
            rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
            tmp = pd.DataFrame([rowDict])
            output = pd.concat([output, tmp])
            TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
            del number_ref_C, number_ref_F
        elif line[:8] == "##TEND##":
            break

pd.set_option('display.max_columns', None)
output

How to efficiently parse fixed width files?

Using the Python standard library's struct module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.

from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

Output:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Update

As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct version).

What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)

Which is equivalent to a statement like the following, depending on the values of i and j in the for loop, to something looking like this:

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)

However the latter executes more than twice as fast since the slice boundaries are all constants.

Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval() function:

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse