How to Efficiently Parse Fixed Width Files

How to efficiently parse fixed width files?

Using the Python standard library's struct module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.

from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

Output:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Update

As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct version).

What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)

Which is equivalent to a statement like the following, depending on the values of i and j in the for loop, to something looking like this:

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)

However the latter executes more than twice as fast since the slice boundaries are all constants.

Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval() function:

def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code.
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

Parse fixed-width files

As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:

my @template = map {'A'.length}        # convert each to 'A##'
<DATA> =~ /(\S+\s*)/g; # split first line into segments
$template[-1] = 'A*'; # set the last segment to be slurpy

my $template = "@template";
print "template: $template\n";

my @data;
while (<DATA>) {
push @data, [unpack $template, $_]
}

use Data::Dumper;

print Dumper \@data;

__DATA__
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street

which prints:


template: A8 A10 A*
$VAR1 = [
[
'Dave',
'Thomas',
'123 Main'
],
[
'Dan',
'Anderson',
'456 Center'
],
[
'Wilma',
'Rainbow',
'789 Street'
]
];

Is it possible to parse a fixed width file without knowing the width of each column or the column names and converting it to a CSV?

No !!!

It is only very simplest Fixed width files could be passed in this way.
Fixed width files can

  • Have multiple record layouts
  • Binary fields
  • Could be Cobol files
  • For some fields you need to know what the field definition is to correctly
    interpret them. For example decimal points could be assumed i.e. 12345 could be 123.45,
    1.2345 etc.
  • Text fields are normally left justified,

For fixed width files you need a file Description (chema)

Cobol File

A common source of Fixed width files is from Cobol applications.
Cobol Fixed width files

  • Never have Column headings
  • Generally no space between fields
  • Could have binary fields
  • Decimal points are assumed
  • Zoned Decimal

Have a look at the file in this questions

  • Is this possible to convert EBCDIC Comp-3 file to ASCII file values using java?

Software

  • Microsoft Excel / Access + most spreadsheets has Fixed width import wizards
  • RecordEditor/Recsveditor have wizards for fixed width files + can edit fixed width files

Faster way to read fixed-width files

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000

#creating the contents at random
contents = write.table(
replicate(
rrows,
paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
),
file = "testfwf.txt",
quote = FALSE, row.names = FALSE, col.names = FALSE
)

#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
beg = endpoints[-(n_fields + 1L)],
end = endpoints[-1L] - 1L
)

dict = data.frame(
column = paste0("V", seq_len(length(endpoints)) - 1L)),
start = endpoints[-length(endpoints)] - 1,
length = diff(endpoints)
)

write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

Here is the benchmarking code:

library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)

microbenchmark(
times = 5L,
utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
in2csv = fread(cmd = sprintf(
"in2csv -f fixed -s %s %s",
"testdic.csv", "testfwf.txt"
)),
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
LaF = {
my.data.laf = laf_open_fwf(
'testfwf.txt',
column_widths = diff(endpoints),
column_types = rep("character", length(endpoints) - 1L)
)
my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
as.data.frame(my.data)
},
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
}
)

And the output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c
# in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a
# readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a
# LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b
# fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

EDIT: Adding some alternatives

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)

microbenchmark(
times = 5L,
readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
fread = {
DT = fread("testfwf.txt", header = FALSE, sep = "\n")
DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
stri_sub(V1, cols$beg[ii], cols$end[ii])
})]
},
iotools = input.file(
"testfwf.txt", formatter = dstrfw,
col_types = rep("character", length(endpoints) - 1L),
widths = diff(endpoints)
),
awk = fread(header = FALSE, cmd = sprintf(
"awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
paste(diff(endpoints), collapse = " ")
))
)

And the new output:

# Unit: seconds
# expr min lq mean median uq max neval cld
# readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a
# fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a
# iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a
# awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b

So it appears iotools is both very fast and very consistent.

Parsing a fixed-width formatted file in Java

I would use a flat file parser like flatworm instead of reinventing the wheel: it has a clean API, is simple to use, has decent error handling and a simple file format descriptor. Another option is jFFP but I prefer the first one.

Parse flat-file (positional text-file) to read the wavelength

I think you are better off using regex

Example:

import re

text = '''FI R 83.0000m 34.960 1.1262 Fe 2 1.32055m 33.626 0.0522 N
2 5754.61A 33.290 0.0241
TI R 1800.00m 33.092 0.0153 Fe 2 1.24854m 32.645 0.0054 N
2 915.612A 31.997 0.0012
NI Ra 2.85000m 36.291 24.1132 Fe 2 7637.54A 33.077 0.0147'''

find_this = re.findall('(Fe 2.*?[0-9].*?)\s', text)
print(find_this)

Output:

['Fe 2      1.32055m', 'Fe 2      1.24854m', 'Fe 2      7637.54A']

[Program finished]

Or if you only want the values.

find_this = re.findall('Fe 2.*?([0-9].*?)\s', text)

Output:

['1.32055m', '1.24854m', '7637.54A']

[Program finished]

ANSWER TO NEW QUESTION

Here's an example of how you could achieve picking out values between 1.35 - 1.40 using a for loop and converting var into float. Now we can use conditions as in this line:

if (float_value >= 1.35) and (float_value <= 1.40):
print(value)

if matched it prints the untouched string, keeping the ending letter.

Here's the full code:
(I shortened the text for easier read)

import re

text = '''Fe 2 1.405A 33.077 0.0147
Fe 2 1.305A 33.077 0.0147
Fe 2 1.345A 33.077 0.0147
Fe 2 1.35A 33.077 0.0147
Fe 2 1.35623A 33.077 0.0147
Fe 2 1.40A 33.077 0.0147
Fe 2 1.37A 33.077 0.0147
Fe 2 1.41A 33.077 0.0147'''

find_this = re.findall('Fe 2.*?([0-9].*?)\s', text)

for value in find_this:
del_letters = re.sub('[A-Za-z]', '', value)
float_value = float(del_letters)
if (float_value >= 1.35) and (float_value <= 1.40):
print(value)

Output:

1.35A
1.35623A
1.40A
1.37A

Parse fixed-width files

As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:

my @template = map {'A'.length}        # convert each to 'A##'
<DATA> =~ /(\S+\s*)/g; # split first line into segments
$template[-1] = 'A*'; # set the last segment to be slurpy

my $template = "@template";
print "template: $template\n";

my @data;
while (<DATA>) {
push @data, [unpack $template, $_]
}

use Data::Dumper;

print Dumper \@data;

__DATA__
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street

which prints:


template: A8 A10 A*
$VAR1 = [
[
'Dave',
'Thomas',
'123 Main'
],
[
'Dan',
'Anderson',
'456 Center'
],
[
'Wilma',
'Rainbow',
'789 Street'
]
];

Dealing with non-ASCII characters when parsing fixed-width txt file

The first snippet in the related answer only works for single-byte codepages because it counts bytes, not characters. It doesn't even work for UTF16 which usually uses 2 bytes per character and certainly not UTF8 which uses a variable number of bytes. That was pointed out in the comments.

The same answer shows how to handle UTF8 in the third snippet:

from itertools import accumulate, zip_longest
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code.
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse

Using it on a string including the question's text produces the expected results. Note that the original string has 18, not 19 characters :

>>> parser = make_parser([18,3,4])
>>> line="Zürich, Albisgütli3456789"
>>> parser(line)
('Zürich, Albisgütli', '345', '6789')

And to prove this really does work with UTF8 text files:

with open('testutf8.txt',encoding="utf-8",mode='w') as f:
for i in range(3):
f.write(line)
f.write('\n')

with open('testutf8.txt',encoding="utf-8",mode='r') as f1:
for line in f1.readlines():
print(parser(line))
-------
('Zürich, Albisgütli', '345', '6789')
('Zürich, Albisgütli', '345', '6789')
('Zürich, Albisgütli', '345', '6789')

The question's code should be split into separate functions, one to read the data and another to generate the XML output. Both operations are already available through various modules though. There are several modules that can read fixed-width files, several XML parser and serialization libraries, and some, like Pandas, can read multiple data formats, process the data and export it as XML.

Using Pandas

Pandas is one of the most popular if not the most popular module for analytics and data science. It's a great tool for data processing too. It's part of Python distributions created for analysis, like Anaconda.

For example, with Pandas, this code could be replaced with just 2 function calls:

import pandas as pd
namespaces={
"xmlns" : "http://www.bar.admin.ch/xmlns/siard/2/table.xsd" ,
"xsi" : "http://www.w3.org/2001/XMLSchema-instance" ,
...
}

df=pd.read_fwf('data.csv')
df.to_xml('data.xml', root_name='table', namespaces=namespaces)

This will be faster and use less memory than the explicit string manipulations in the question's code. String manipulations create new temporary strings each time, costing in both CPU and RAM.

By default read_fwf will try to infer the column widths based on the first 100 rows. You can increase the number of rows with the infer_nrows parameter or specify a list of (from,to) tuples with the colspecs parameter, eg colspecs=[(1,3),(3,5),(10,14)]

to_xml offers several parameters for controlling the XML output, like namespaces, what element name to use for the root and rows, which columns to output as attributes and which as child elements, etc. It can even write to a compressed file

Attribute names can be specified through the attr_cols parameter, eg:

df.to_xml(attr_cols=[
'index', 'shape', 'degrees', 'sides'
])

You can also rename Dataframe columns, change their type eg to parse string fields into dates or numbers:

df['Timestamp'] = pd.to_datetime(df['Col3'],infer_datetime_format=True)
df=df.rename(columns={'Col1':'Bananas',...})

Using Standard Library xml modules

Even if you can't use Pandas to solve the entire problem in a couple of lines, you can use one of Python's xml processing modules. These are part of the Python Standard Library which means they're available in all distributions

The Building XML Documents example of the ElementTree module shows how to create XML docs programmatically :

>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
<a><b /><c><d /></c></a>

Corporate Policies

That's not a logical argument. It may sound so to non-technical managers with limited security experience, certainly limited Python experience. Or C#, Go, JavaScript experience. Security is on of the main reason even "standard" libraries/modules/assemblies are distributed and updated through package managers. Nobody wants (or can afford) to wait 2 years for security and bug fixes any more.

The Python Standard Library contains modules that started as third-party packages. That's certainly the case with ElementTree. In many cases the docs advise using external packages to handle more complex cases.

Besides, Pandas is almost certainly used in the company already. It's one of the most common if not the most common library in analytics and data science. In fact, distributions like Anaconda already include it.

Does whoever wrote that policy understand they're telling you to keep using already disclosed security vulnerabilities and conciously deploy insecure code? Because that's what you get if you don't upgrade packages. And every Python distribution comes with a lot of packages that require updating after a while.

Finally, with your hand-written code you have to reinvent and support code that's already written, tested and vetted for several years in various packages. That's time that's not spent producing valuable, billable work for the company.

A rule of thumb is that a company needs to make 2-3x your gross salary to justify the work. How much will be the cost of hand-written text reading and XML generation over the product's lifetime, when you include bug fixes and downtime incurred due to those bugs?



Related Topics



Leave a reply



Submit