Read fixed width text file
This is a fixed width file. Use read.fwf()
to read it:
x <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))
head(x)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 03JAN1990 23.4 -0.4 25.1 -0.3 26.6 0.0 28.6 0.3
2 10JAN1990 23.4 -0.8 25.2 -0.3 26.6 0.1 28.6 0.3
3 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5 31JAN1990 25.1 -0.2 25.8 -0.2 26.7 0.1 28.4 0.2
6 07FEB1990 25.8 0.2 26.1 -0.1 26.8 0.1 28.4 0.3
Update
The package readr
(released April, 2015) provides a simple and fast alternative.
library(readr)
x <- read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))
Speed comparison: readr::read_fwf()
was ~2x faster than utils::read.fwf ()
.
Read in txt file with fixed width columns
Use read_fwf
instead of read_csv
.
[
read_fwf
reads] a table of fixed-width formatted lines into DataFrame.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
import pandas as pd
colspecs = (
(0, 44),
(46, 47),
(48, 49),
(50, 51),
(52, 53),
(54, 55),
(56, 57),
(58, 59),
(60, 66),
(67, 73),
(74, 77),
(78, 80),
(81, 84),
(85, 87),
(88, 90),
(91, 95),
(96, 99),
(100, 103),
(104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"
df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
"Vehicle Name",
"Is Sports Car",
"Is SUV",
"Is Wagon",
"Is Minivan",
"Is Pickup",
"Is All-Wheel Drive",
"Is Rear-Wheel Drive",
"Suggested Retail Price",
"Dealer Cost",
"Engine Size (litres)",
"Number of Cylinders",
"Horsepower",
"City Miles Per Gallon",
"Highway Miles Per Gallon",
"Weight (pounds)",
"Wheel Base (inches)",
"Lenght (inches)",
"Width (inches)",
)
And the output for print(df)
would be:
Vehicle Name ... Width (inches)
0 Chevrolet Aveo LS 4dr hatch ... 66
1 Chevrolet Cavalier 2dr ... 69
2 Chevrolet Cavalier 4dr ... 68
3 Chevrolet Cavalier LS 2dr ... 69
4 Dodge Neon SE 4dr ... 67
.. ... ... ...
422 Nissan Titan King Cab XE ... *
423 Subaru Baja ... *
424 Toyota Tacoma ... *
425 Toyota Tundra Regular Cab V6 ... *
426 Toyota Tundra Access Cab V6 SR5 ... *
[427 rows x 19 columns]
Column names and specifications retrieved from here:
- http://jse.amstat.org/datasets/04cars.txt
Note: Don't forget to specify where each column starts and ends. Without using colspecs
, pandas
is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv
files (with specs and without):
pyspark parse fixed width text file
Spark's substr function can handle fixed-width columns, for example:
df = spark.read.text("/tmp/sample.txt")
df.select(
df.value.substr(1,3).alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string'),
df.value.substr(15,4).cast('integer').alias('integer')
).show()
will result in:
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+
Having splitted columns you can reformat and use them as in normal spark dataframe.
Reading fixed width format file in R
Here's a command that should read in the data
dd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
widths=c(10, rep(c(9,4),4)), skip=4)
Note that the widths need to account for all characters in each line, so even if there are blank spaces between columns, you need to assign those to one of the columns.
Then I can't think of a super-clean way to get the headers. This works but it's ugly and makes assumptions
hd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
widths=c(10,rep(c(9,4),4)), skip=2, nrow=2, as.is=T)
trim <- function(x) gsub("^\\s+|\\s+$","",x)
main <- paste0(trim(hd[1,seq(2, ncol(hd), by=2)]), trim(hd[1,seq(3, ncol(hd), by=2)]))
sub <- trim(as.vector(hd[2,]))
names(dd) <- make.names(c(sub[1],paste(rep(main, each=2), sub[-1])))
and finally, you can make a proper date value with
dd$Week <- as.Date(as.character(dd$Week), "%d%b%Y")
You shouldn't be using the sep=
parameter at all. What read.fwf
actually does is re-write the fixed with file as a delimited file using sep as the delimiter and then reads the delimited file with the more standard read.table()
. The default value of sep="\t"
is usually safe as generally you do not have tabs in your actual data.
Python conditionally read fixed width text file and create DataFrame
Here you go.
import pandas as pd
import numpy as np
def define_empty_dict():
TYPE___A = {'ST': None, 'COUNT': None}
TYPE___B = {'KEY': None}
TYPE___C = {'C_INFO1': None, 'C_INFO2': None, 'C_INFO3': None, 'C_INFO4': None}
TYPE___D = {'DOB': None, 'GENDER': None}
TYPE___E = {'E_INFO': None}
TYPE___F = {'F_INFO1': None, 'F_INFO2': None, 'F_INFO3': None, 'F_INFO4': None, 'F_INFO5': None, 'F_INFO6': None, 'F_INFO7': None, 'F_INFO8': None, 'F_INFO9': None}
TYPE___G = {'G_INFO': None}
TYPE___J = {'J_INFO': None}
TYPE___K = {'K_INFO': None}
TYPE___L = {'L_INFO': None}
return TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L
TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
output = pd.DataFrame(columns = rowDict.keys())
##
with open("test.txt", 'r') as file:
for i, line in enumerate(file):
if line[:8] == "##TSTA##":
continue
elif line[19:20] == "A":
ID_A = line[0:15]
TYPE___A['ST'] = line[20:22]
TYPE___A['COUNT'] = line[22:26]
elif (line[19:20] == "B") & (line[0:15] == ID_A):
TYPE___B['KEY'] = line[20:80]
elif (line[19:20] == "C") & (line[0:15] == ID_A):
if "number_ref_C" not in globals():
number_ref_C = int(line[15:19])
c = 1
TYPE___C[f'C_INFO{c}'] = line[20:60]
else :
c += 1
TYPE___C[f'C_INFO{c}'] = line[20:60]
elif (line[19:20] == "D") & (line[0:15] == ID_A):
TYPE___D['DOB'] = line[20:28]
TYPE___D['GENDER'] = line[35:36]
elif (line[19:20] == "E") & (line[0:15] == ID_A):
TYPE___E['E_INFO'] = line[20:39]
elif (line[19:20] == "F") & (line[0:15] == ID_A):
if "number_ref_F" not in globals():
number_ref_F = int(line[15:19])
f = 1
TYPE___F[f'F_INFO{f}'] = line[20:60]
else :
f += 1
TYPE___F[f'F_INFO{f}'] = line[20:60]
elif (line[19:20] == "G") & (line[0:15] == ID_A):
TYPE___G['G_INFO'] = line[20:39]
elif (line[19:20] == "J") & (line[0:15] == ID_A):
TYPE___J['J_INFO'] = line[20:39]
elif (line[19:20] == "K") & (line[0:15] == ID_A):
TYPE___K['K_INFO'] = line[20:39]
elif (line[19:20] == "L") & (line[0:15] == ID_A):
TYPE___L['L_INFO'] = line[20:39]
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
tmp = pd.DataFrame([rowDict])
output = pd.concat([output, tmp])
TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
del number_ref_C, number_ref_F
elif line[:8] == "##TEND##":
break
pd.set_option('display.max_columns', None)
output
How to efficiently parse fixed width files?
Using the Python standard library's struct
module would be fairly easy as well as fairly fast since it's written in C. The code below how it use it. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.
import struct
fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)
# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))
Output:
fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. It is kind of complicated and speedwise it's about the same as the version based the struct
module — although I have an idea about how it could be sped up (which might make the extra complexity worthwhile). See update below on that topic.
from itertools import zip_longest
from itertools import accumulate
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))
Output:
format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Update
As I suspected, there is a way of making the string-slicing version of the code faster — which in Python 2.7 make it about the same speed as the version using struct
, but in Python 3.x make it 233% faster (as well as the un-optimized version of itself which is about the same speed as the struct
version).
What the version presented above does is define a lambda function that's primarily a comprehension that generates the limits of a bunch of slices at runtime.
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
Which is equivalent to a statement like the following, depending on the values of i
and j
in the for
loop, to something looking like this:
parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)
However the latter executes more than twice as fast since the slice boundaries are all constants.
Fortunately it relatively easy to convert and "compile" the former into the latter using the built-in eval()
function:
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code.
# Optional informational function attributes.
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
Read fixed-width text file with varchar in pandas
You could use read_table with a regex delimiter and a converter to read the data, followed by a little postprocessing (splitting a column), for example:
import pandas
schema = {
'name': 10,
'age': 2,
'last_visit': 8,
'other_field': 5,
'comment': None,
'fav_color': 8
}
# A converter for the variable length and following columns
def converter(x):
"""Return the comment and the fav_color values separated by ','."""
length_len = 4
comment_len = int(x[:length_len])
return x[length_len:comment_len + length_len:] + ',' + x[comment_len + length_len:]
# A regex as delimiter for the fixed length columns
delimiter = f"(.{{{schema['name']}}})(.{{{schema['age']}}})(.{{{schema['last_visit']}}}).{{{schema['other_field']}}}(.*)"
# Use the delimiter and converter (column 4 holds comment and fav_color) for reading the table
data = pandas.read_table('input.txt', header=None, sep=delimiter, converters={4: converter})
# Clean the table
data.dropna(inplace=True, axis=1)
# Split the comment and the fav_color columns
data[5], data[6] = data[4].str.split(',', 1).str
Related Topics
Select Groups With More Than One Distinct Value
Why Is the Parallel Package Slower Than Just Using Apply
Fastest Way to Find Second (Third...) Highest/Lowest Value in Vector or Column
A Similar Function to R'S Rep in Matlab
Plot With Conditional Colors Based on Values in R
Format Numbers With Million (M) and Billion (B) Suffixes
Return Elements of List as Independent Objects in Global Environment
R Stacked Percentage Bar Plot With Percentage of Binary Factor and Labels (With Ggplot)
Formatting Dates on X Axis in Ggplot2
Painless Way to Install a New Version of R
How to Merge 2 Vectors Alternating Indexes
How to Convert a Table to a Data Frame
How to Put Labels Over Geom_Bar For Each Bar in R With Ggplot2
Pass a Vector of Variables into Lm() Formula
Remove Columns With Zero Values from a Dataframe
Dplyr Join on By=(A = B), Where a and B Are Variables Containing Strings