Read Fixed Width Record from Text File

Read fixed width record from text file

Substring sounds good to me. The only downside I can immediately think of is that it means copying the data each time, but I wouldn't worry about that until you prove it's a bottleneck. Substring is simple :)

You could use a regex to match a whole record at a time and capture the fields, but I think that would be overkill.

Read fixed width text file

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

Read in txt file with fixed width columns

Use read_fwf instead of read_csv.

[read_fwf reads] a table of fixed-width formatted lines into DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

import pandas as pd

colspecs = (
    (0, 44),
    (46, 47),
    (48, 49),
    (50, 51),
    (52, 53),
    (54, 55),
    (56, 57),
    (58, 59),
    (60, 66),
    (67, 73),
    (74, 77),
    (78, 80),
    (81, 84),
    (85, 87),
    (88, 90),
    (91, 95),
    (96, 99),
    (100, 103),
    (104, 106),
)

data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"

df = pd.read_fwf(data_url, colspecs=colspecs)

df.columns = (
    "Vehicle Name",
    "Is Sports Car",
    "Is SUV",
    "Is Wagon",
    "Is Minivan",
    "Is Pickup",
    "Is All-Wheel Drive",
    "Is Rear-Wheel Drive",
    "Suggested Retail Price",
    "Dealer Cost",
    "Engine Size (litres)",
    "Number of Cylinders",
    "Horsepower",
    "City Miles Per Gallon",
    "Highway Miles Per Gallon",
    "Weight (pounds)",
    "Wheel Base (inches)",
    "Lenght (inches)",
    "Width (inches)",
)

And the output for print(df) would be:

                        Vehicle Name  ...  Width (inches)
0        Chevrolet Aveo LS 4dr hatch  ...              66
1             Chevrolet Cavalier 2dr  ...              69
2             Chevrolet Cavalier 4dr  ...              68
3          Chevrolet Cavalier LS 2dr  ...              69
4                  Dodge Neon SE 4dr  ...              67
..                               ...  ...             ...
422         Nissan Titan King Cab XE  ...               *
423                      Subaru Baja  ...               *
424                    Toyota Tacoma  ...               *
425     Toyota Tundra Regular Cab V6  ...               *
426  Toyota Tundra Access Cab V6 SR5  ...               *

[427 rows x 19 columns]

Column names and specifications retrieved from here:

http://jse.amstat.org/datasets/04cars.txt

Note: Don't forget to specify where each column starts and ends. Without using colspecs, pandas is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv files (with specs and without):

Sample Image

F# Read Fixed Width Text File

The hardest part is probably to split a single line according to the column format. It can be done something like this:

let splitLine format (line : string) =
    format |> List.map (fun (index, length) -> line.Substring(index, length))

This function has the type (int * int) list -> string -> string list. In other words, format is an (int * int) list. This corresponds exactly to your format list. The line argument is a string, and the function returns a string list.

You can map a list of lines like this:

let result = lines |> List.map (splitLine format)

You can also use Seq.map or Array.map, depending on how lines is defined. Such a result will be a string list list, and you can now map over such a list to produce a MyRecord list.

You can use File.ReadLines to get a lazily evaluated sequence of strings from a file.

Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.

Python conditionally read fixed width text file and create DataFrame

Here you go. Sample Image

import pandas as pd
import numpy as np

def define_empty_dict():
    TYPE___A = {'ST': None, 'COUNT': None}

    TYPE___B = {'KEY': None}

    TYPE___C = {'C_INFO1': None, 'C_INFO2': None, 'C_INFO3': None, 'C_INFO4': None}

    TYPE___D = {'DOB': None, 'GENDER': None}

    TYPE___E = {'E_INFO': None}

    TYPE___F = {'F_INFO1': None, 'F_INFO2': None, 'F_INFO3': None, 'F_INFO4': None, 'F_INFO5': None, 'F_INFO6': None, 'F_INFO7': None, 'F_INFO8': None, 'F_INFO9': None}

    TYPE___G = {'G_INFO': None}

    TYPE___J = {'J_INFO': None}

    TYPE___K = {'K_INFO': None}

    TYPE___L = {'L_INFO': None}
    return TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L

TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
output = pd.DataFrame(columns = rowDict.keys())

##

with open("test.txt", 'r') as file:

    for i, line in enumerate(file):

        if line[:8] == "##TSTA##":
            continue
        elif line[19:20] == "A":
            ID_A = line[0:15]
            TYPE___A['ST']    = line[20:22]
            TYPE___A['COUNT'] = line[22:26]
        elif (line[19:20] == "B") & (line[0:15] == ID_A):
            TYPE___B['KEY']  = line[20:80]
        elif (line[19:20] == "C") & (line[0:15] == ID_A):
            if "number_ref_C" not in globals():
                number_ref_C = int(line[15:19])
                c = 1
                TYPE___C[f'C_INFO{c}'] = line[20:60]
            else :
                c += 1
                TYPE___C[f'C_INFO{c}'] = line[20:60]
        elif (line[19:20] == "D") & (line[0:15] == ID_A):
            TYPE___D['DOB']  = line[20:28]
            TYPE___D['GENDER']  = line[35:36]
        elif (line[19:20] == "E") & (line[0:15] == ID_A):
            TYPE___E['E_INFO']  = line[20:39]
        elif (line[19:20] == "F") & (line[0:15] == ID_A):
            if "number_ref_F" not in globals():
                number_ref_F = int(line[15:19])
                f = 1
                TYPE___F[f'F_INFO{f}'] = line[20:60]
            else :
                f += 1
                TYPE___F[f'F_INFO{f}'] = line[20:60]
        elif (line[19:20] == "G") & (line[0:15] == ID_A):
            TYPE___G['G_INFO']  = line[20:39]
        elif (line[19:20] == "J") & (line[0:15] == ID_A):
            TYPE___J['J_INFO']  = line[20:39]
        elif (line[19:20] == "K") & (line[0:15] == ID_A):
            TYPE___K['K_INFO']  = line[20:39]
        elif (line[19:20] == "L") & (line[0:15] == ID_A):
            TYPE___L['L_INFO']  = line[20:39]
            rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
            tmp = pd.DataFrame([rowDict])
            output = pd.concat([output, tmp])
            TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
            del number_ref_C, number_ref_F
        elif line[:8] == "##TEND##":
            break

pd.set_option('display.max_columns', None)
output

Read fixed-width text file with varchar in pandas

You could use read_table with a regex delimiter and a converter to read the data, followed by a little postprocessing (splitting a column), for example:

import pandas

schema = {
    'name': 10,
    'age': 2,
    'last_visit': 8,
    'other_field': 5,
    'comment': None,
    'fav_color': 8
}

# A converter for the variable length and following columns
def converter(x):
    """Return the comment and the fav_color values separated by ','."""
    length_len = 4
    comment_len = int(x[:length_len])
    return x[length_len:comment_len + length_len:] + ',' + x[comment_len + length_len:]

# A regex as delimiter for the fixed length columns
delimiter = f"(.{{{schema['name']}}})(.{{{schema['age']}}})(.{{{schema['last_visit']}}}).{{{schema['other_field']}}}(.*)"
# Use the delimiter and converter (column 4 holds comment and fav_color) for reading the table
data = pandas.read_table('input.txt', header=None, sep=delimiter, converters={4: converter})
# Clean the table
data.dropna(inplace=True, axis=1)
# Split the comment and the fav_color columns
data[5], data[6] = data[4].str.split(',', 1).str

Read multi-row fixed width records from textfile

Using the FileHelpers library, your example could be parsed as follows:

Declare a class to represent your objects:

[IgnoreFirst(2)]
[FixedLengthRecord(FixedMode.ExactLength)]
public sealed class Record
{
    [FieldTrim(TrimMode.Right)]
    [FieldFixedLength(6)]
    public String Header1;

    [FieldFixedLength(3)]
    public String Data1;

    [FieldInNewLine()]
    [FieldTrim(TrimMode.Right)]
    [FieldFixedLength(6)]
    public String Header2;

    [FieldFixedLength(3)]
    public String Data2;

    [FieldInNewLine()]
    [FieldTrim(TrimMode.Right)]
    [FieldFixedLength(6)]
    public String Header3;

    [FieldFixedLength(3)]
    public String Data3;
}

Load the data from the file like so:

FileHelperEngine<Record> engine = new FileHelperEngine<Record>();

engine.ErrorManager.ErrorMode = ErrorMode.SaveAndContinue;

DataTable records = engine.ReadFileAsDT(@"myTextFile.txt");

if (engine.ErrorManager.ErrorCount > 0)
    engine.ErrorManager.SaveErrors("Errors.txt");

Read Fixed Width Record from Text File