Read fixed width record from text file
Substring sounds good to me. The only downside I can immediately think of is that it means copying the data each time, but I wouldn't worry about that until you prove it's a bottleneck. Substring is simple :)
You could use a regex to match a whole record at a time and capture the fields, but I think that would be overkill.
Read fixed width text file
This is a fixed width file. Use read.fwf()
to read it:
x <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))
head(x)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 03JAN1990 23.4 -0.4 25.1 -0.3 26.6 0.0 28.6 0.3
2 10JAN1990 23.4 -0.8 25.2 -0.3 26.6 0.1 28.6 0.3
3 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5 31JAN1990 25.1 -0.2 25.8 -0.2 26.7 0.1 28.4 0.2
6 07FEB1990 25.8 0.2 26.1 -0.1 26.8 0.1 28.4 0.3
Update
The package readr
(released April, 2015) provides a simple and fast alternative.
library(readr)
x <- read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))
Speed comparison: readr::read_fwf()
was ~2x faster than utils::read.fwf ()
.
Read in txt file with fixed width columns
Use read_fwf
instead of read_csv
.
[
read_fwf
reads] a table of fixed-width formatted lines into DataFrame.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
import pandas as pd
colspecs = (
(0, 44),
(46, 47),
(48, 49),
(50, 51),
(52, 53),
(54, 55),
(56, 57),
(58, 59),
(60, 66),
(67, 73),
(74, 77),
(78, 80),
(81, 84),
(85, 87),
(88, 90),
(91, 95),
(96, 99),
(100, 103),
(104, 106),
)
data_url = "http://jse.amstat.org/datasets/04cars.dat.txt"
df = pd.read_fwf(data_url, colspecs=colspecs)
df.columns = (
"Vehicle Name",
"Is Sports Car",
"Is SUV",
"Is Wagon",
"Is Minivan",
"Is Pickup",
"Is All-Wheel Drive",
"Is Rear-Wheel Drive",
"Suggested Retail Price",
"Dealer Cost",
"Engine Size (litres)",
"Number of Cylinders",
"Horsepower",
"City Miles Per Gallon",
"Highway Miles Per Gallon",
"Weight (pounds)",
"Wheel Base (inches)",
"Lenght (inches)",
"Width (inches)",
)
And the output for print(df)
would be:
Vehicle Name ... Width (inches)
0 Chevrolet Aveo LS 4dr hatch ... 66
1 Chevrolet Cavalier 2dr ... 69
2 Chevrolet Cavalier 4dr ... 68
3 Chevrolet Cavalier LS 2dr ... 69
4 Dodge Neon SE 4dr ... 67
.. ... ... ...
422 Nissan Titan King Cab XE ... *
423 Subaru Baja ... *
424 Toyota Tacoma ... *
425 Toyota Tundra Regular Cab V6 ... *
426 Toyota Tundra Access Cab V6 SR5 ... *
[427 rows x 19 columns]
Column names and specifications retrieved from here:
- http://jse.amstat.org/datasets/04cars.txt
Note: Don't forget to specify where each column starts and ends. Without using colspecs
, pandas
is making an assumption based on the first row which leads to data errors. Below an extract of a unified diff between generated csv
files (with specs and without):
F# Read Fixed Width Text File
The hardest part is probably to split a single line according to the column format. It can be done something like this:
let splitLine format (line : string) =
format |> List.map (fun (index, length) -> line.Substring(index, length))
This function has the type (int * int) list -> string -> string list
. In other words, format
is an (int * int) list
. This corresponds exactly to your format
list. The line
argument is a string
, and the function returns a string list
.
You can map a list of lines like this:
let result = lines |> List.map (splitLine format)
You can also use Seq.map
or Array.map
, depending on how lines
is defined. Such a result
will be a string list list
, and you can now map over such a list to produce a MyRecord list
.
You can use File.ReadLines
to get a lazily evaluated sequence of strings from a file.
Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.
Python conditionally read fixed width text file and create DataFrame
Here you go.
import pandas as pd
import numpy as np
def define_empty_dict():
TYPE___A = {'ST': None, 'COUNT': None}
TYPE___B = {'KEY': None}
TYPE___C = {'C_INFO1': None, 'C_INFO2': None, 'C_INFO3': None, 'C_INFO4': None}
TYPE___D = {'DOB': None, 'GENDER': None}
TYPE___E = {'E_INFO': None}
TYPE___F = {'F_INFO1': None, 'F_INFO2': None, 'F_INFO3': None, 'F_INFO4': None, 'F_INFO5': None, 'F_INFO6': None, 'F_INFO7': None, 'F_INFO8': None, 'F_INFO9': None}
TYPE___G = {'G_INFO': None}
TYPE___J = {'J_INFO': None}
TYPE___K = {'K_INFO': None}
TYPE___L = {'L_INFO': None}
return TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L
TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
output = pd.DataFrame(columns = rowDict.keys())
##
with open("test.txt", 'r') as file:
for i, line in enumerate(file):
if line[:8] == "##TSTA##":
continue
elif line[19:20] == "A":
ID_A = line[0:15]
TYPE___A['ST'] = line[20:22]
TYPE___A['COUNT'] = line[22:26]
elif (line[19:20] == "B") & (line[0:15] == ID_A):
TYPE___B['KEY'] = line[20:80]
elif (line[19:20] == "C") & (line[0:15] == ID_A):
if "number_ref_C" not in globals():
number_ref_C = int(line[15:19])
c = 1
TYPE___C[f'C_INFO{c}'] = line[20:60]
else :
c += 1
TYPE___C[f'C_INFO{c}'] = line[20:60]
elif (line[19:20] == "D") & (line[0:15] == ID_A):
TYPE___D['DOB'] = line[20:28]
TYPE___D['GENDER'] = line[35:36]
elif (line[19:20] == "E") & (line[0:15] == ID_A):
TYPE___E['E_INFO'] = line[20:39]
elif (line[19:20] == "F") & (line[0:15] == ID_A):
if "number_ref_F" not in globals():
number_ref_F = int(line[15:19])
f = 1
TYPE___F[f'F_INFO{f}'] = line[20:60]
else :
f += 1
TYPE___F[f'F_INFO{f}'] = line[20:60]
elif (line[19:20] == "G") & (line[0:15] == ID_A):
TYPE___G['G_INFO'] = line[20:39]
elif (line[19:20] == "J") & (line[0:15] == ID_A):
TYPE___J['J_INFO'] = line[20:39]
elif (line[19:20] == "K") & (line[0:15] == ID_A):
TYPE___K['K_INFO'] = line[20:39]
elif (line[19:20] == "L") & (line[0:15] == ID_A):
TYPE___L['L_INFO'] = line[20:39]
rowDict = {**TYPE___A, **TYPE___B, **TYPE___C, **TYPE___D, **TYPE___E, **TYPE___F, **TYPE___G, **TYPE___J, **TYPE___K, **TYPE___L}
tmp = pd.DataFrame([rowDict])
output = pd.concat([output, tmp])
TYPE___A, TYPE___B, TYPE___C, TYPE___D, TYPE___E,TYPE___F, TYPE___G, TYPE___J, TYPE___K, TYPE___L = define_empty_dict()
del number_ref_C, number_ref_F
elif line[:8] == "##TEND##":
break
pd.set_option('display.max_columns', None)
output
Read fixed-width text file with varchar in pandas
You could use read_table with a regex delimiter and a converter to read the data, followed by a little postprocessing (splitting a column), for example:
import pandas
schema = {
'name': 10,
'age': 2,
'last_visit': 8,
'other_field': 5,
'comment': None,
'fav_color': 8
}
# A converter for the variable length and following columns
def converter(x):
"""Return the comment and the fav_color values separated by ','."""
length_len = 4
comment_len = int(x[:length_len])
return x[length_len:comment_len + length_len:] + ',' + x[comment_len + length_len:]
# A regex as delimiter for the fixed length columns
delimiter = f"(.{{{schema['name']}}})(.{{{schema['age']}}})(.{{{schema['last_visit']}}}).{{{schema['other_field']}}}(.*)"
# Use the delimiter and converter (column 4 holds comment and fav_color) for reading the table
data = pandas.read_table('input.txt', header=None, sep=delimiter, converters={4: converter})
# Clean the table
data.dropna(inplace=True, axis=1)
# Split the comment and the fav_color columns
data[5], data[6] = data[4].str.split(',', 1).str
Read multi-row fixed width records from textfile
Using the FileHelpers library, your example could be parsed as follows:
Declare a class to represent your objects:
[IgnoreFirst(2)]
[FixedLengthRecord(FixedMode.ExactLength)]
public sealed class Record
{
[FieldTrim(TrimMode.Right)]
[FieldFixedLength(6)]
public String Header1;
[FieldFixedLength(3)]
public String Data1;
[FieldInNewLine()]
[FieldTrim(TrimMode.Right)]
[FieldFixedLength(6)]
public String Header2;
[FieldFixedLength(3)]
public String Data2;
[FieldInNewLine()]
[FieldTrim(TrimMode.Right)]
[FieldFixedLength(6)]
public String Header3;
[FieldFixedLength(3)]
public String Data3;
}
Load the data from the file like so:
FileHelperEngine<Record> engine = new FileHelperEngine<Record>();
engine.ErrorManager.ErrorMode = ErrorMode.SaveAndContinue;
DataTable records = engine.ReadFileAsDT(@"myTextFile.txt");
if (engine.ErrorManager.ErrorCount > 0)
engine.ErrorManager.SaveErrors("Errors.txt");
Related Topics
How Does Hashset Compare Elements for Equality
How to Write an Apple Push Notification Provider in C#
How to Call a R Statistics Function to Optimize C# Function
Does Java Have Something Like C#'s Ref and Out Keywords
Why Learn Perl, Python, Ruby If the Company Is Using C++, C# or Java as the Application Language
What Are Major Differences Between C# and Java
Hmac Sha256 in C# VS Hmac Sha256 in Swift Don't Match
Difference Between Namespace in C# and Package in Java
Pass C# ASP.NET Array to JavaScript Array
How to Read a .Net Guid into a Java Uuid
Invoke C# Code from JavaScript in a Document in a Webbrowser
How to Lock a Table on Read, Using Entity Framework
Dependent Dll Is Not Getting Copied to the Build Output Folder in Visual Studio