Reading a Binary File with Python

Reading a binary file with python

Read the binary file content like this:

with open(fileName, mode='rb') as file: # b is important -> binary
fileContent = file.read()

then "unpack" binary data using struct.unpack:

The start bytes: struct.unpack("iiiii", fileContent[:20])

The body: ignore the heading bytes and the trailing byte (= 24); The remaining part forms the body, to know the number of bytes in the body do an integer division by 4; The obtained quotient is multiplied by the string 'i' to create the correct format for the unpack method:

struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])

The end byte: struct.unpack("i", fileContent[-4:])

Reading binary file and looping over each byte

Python >= 3.8

Thanks to the walrus operator (:=) the solution is quite short. We read bytes objects from the file and assign them to the variable byte

with open("myfile", "rb") as f:
while (byte := f.read(1)):
# Do stuff with byte.

Python >= 3

In older Python 3 versions, we get have to use a slightly more verbose way:

with open("myfile", "rb") as f:
byte = f.read(1)
while byte != b"":
# Do stuff with byte.
byte = f.read(1)

Or as benhoyt says, skip the not equal and take advantage of the fact that b"" evaluates to false. This makes the code compatible between 2.6 and 3.x without any changes. It would also save you from changing the condition if you go from byte mode to text or the reverse.

with open("myfile", "rb") as f:
byte = f.read(1)
while byte:
# Do stuff with byte.
byte = f.read(1)

Python >= 2.5

In Python 2, it's a bit different. Here we don't get bytes objects, but raw characters:

with open("myfile", "rb") as f:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)

Note that the with statement is not available in versions of Python below 2.5. To use it in v 2.5 you'll need to import it:

from __future__ import with_statement

In 2.6 this is not needed.

Python 2.4 and Earlier

f = open("myfile", "rb")
try:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
finally:
f.close()

Reading binary data file in python for analysis

Updated Answer

I played around with this some more in Numpy and you can read the file much more cleanly like this - the information below still applies and explains how it works:

import numpy as np

# Read file and reshape as "records" of 28 bytes each
n = np.fromfile('fort.99',dtype=np.uint8).reshape(-1,28)
I = n[:,4:8].copy().view(np.int32) # pick bytes 4-8, make contiguous and view as integer
A = n[:,8:16].copy().view(np.float64) # pick bytes 8-16, make contiguous and view as float64
B = n[:,16:24].copy().view(np.float64) # pick bytes 16-24, make contiguous and view as float64

Original Answer

I changed your program to have some recognisable data:

      program thing
REAL*8, DIMENSION(128) :: A,B
INTEGER N
N=128

open(unit=99,form='unformatted',status='unknown')
do i=1,N
A(i)=100.0 * i
B(i)=-1000.0 * i
write(99) (2*i),(A(i)),(B(i))
enddo
close(99)
end program

Then I looked at the file size and it was 3,584 bytes, so I divided that by 128 to get the bytes per fortran WRITE as 28.

So, inspected the data with xxd as follows:

xxd -c 28 fort.99

00000000: 1400 0000 0200 0000 0000 0000 0000 5940 0000 0000 0040 8fc0 1400 0000 ..............Y@.....@......
0000001c: 1400 0000 0400 0000 0000 0000 0000 6940 0000 0000 0040 9fc0 1400 0000 ..............i@.....@......
00000038: 1400 0000 0600 0000 0000 0000 00c0 7240 0000 0000 0070 a7c0 1400 0000 ..............r@.....p......
00000054: 1400 0000 0800 0000 0000 0000 0000 7940 0000 0000 0040 afc0 1400 0000 ..............y@.....@......
00000070: 1400 0000 0a00 0000 0000 0000 0040 7f40 0000 0000 0088 b3c0 1400 0000 .............@.@............
0000008c: 1400 0000 0c00 0000 0000 0000 00c0 8240 0000 0000 0070 b7c0 1400 0000 ...............@.....p......
000000a8: 1400 0000 0e00 0000 0000 0000 00e0 8540 0000 0000 0058 bbc0 1400 0000 ...............@.....X......
000000c4: 1400 0000 1000 0000 0000 0000 0000 8940 0000 0000 0040 bfc0 1400 0000 ...............@.....@......
000000e0: 1400 0000 1200 0000 0000 0000 0020 8c40 0000 0000 0094 c1c0 1400 0000 ............. .@............
000000fc: 1400 0000 1400 0000 0000 0000 0040 8f40 0000 0000 0088 c3c0 1400 0000 .............@.@............

So there are 28 bytes per fortran WRITE and there are 4 bytes of indexing at the start and end of each record.

Then I decoded them like this:

#!/usr/bin/env python3

import struct

# Important to open the file in binary mode, 'b'
with open('fort.99','rb') as f:
# You should really read until error in case there are more or fewer than 128 records one day - but you know there are 128

for i in range(128):
# Read one record, i.e. output of one fortran WRITE
record = f.read(28)
# Unpack 2 INTEGERs, 2 DOUBLEs and an INTEGER
_, I, A, B, _ = struct.unpack('2i2di',record)
print(I,A,B)

Output

2 100.0 -1000.0
4 200.0 -2000.0
6 300.0 -3000.0
8 400.0 -4000.0
10 500.0 -5000.0
12 600.0 -6000.0
14 700.0 -7000.0
16 800.0 -8000.0
18 900.0 -9000.0
20 1000.0 -10000.0
22 1100.0 -11000.0
24 1200.0 -12000.0
26 1300.0 -13000.0
28 1400.0 -14000.0
30 1500.0 -15000.0
32 1600.0 -16000.0
34 1700.0 -17000.0
36 1800.0 -18000.0
38 1900.0 -19000.0
40 2000.0 -20000.0
42 2100.0 -21000.0
44 2200.0 -22000.0
46 2300.0 -23000.0
48 2400.0 -24000.0
50 2500.0 -25000.0
52 2600.0 -26000.0
54 2700.0 -27000.0
56 2800.0 -28000.0
58 2900.0 -29000.0
60 3000.0 -30000.0
62 3100.0 -31000.0
64 3200.0 -32000.0
66 3300.0 -33000.0
68 3400.0 -34000.0
70 3500.0 -35000.0
72 3600.0 -36000.0
74 3700.0 -37000.0
76 3800.0 -38000.0
78 3900.0 -39000.0
80 4000.0 -40000.0
82 4100.0 -41000.0
84 4200.0 -42000.0
86 4300.0 -43000.0
88 4400.0 -44000.0
90 4500.0 -45000.0
92 4600.0 -46000.0
94 4700.0 -47000.0
96 4800.0 -48000.0
98 4900.0 -49000.0
100 5000.0 -50000.0
102 5100.0 -51000.0
104 5200.0 -52000.0
106 5300.0 -53000.0
108 5400.0 -54000.0
110 5500.0 -55000.0
112 5600.0 -56000.0
114 5700.0 -57000.0
116 5800.0 -58000.0
118 5900.0 -59000.0
120 6000.0 -60000.0
122 6100.0 -61000.0
124 6200.0 -62000.0
126 6300.0 -63000.0
128 6400.0 -64000.0
130 6500.0 -65000.0
132 6600.0 -66000.0
134 6700.0 -67000.0
136 6800.0 -68000.0
138 6900.0 -69000.0
140 7000.0 -70000.0
142 7100.0 -71000.0
144 7200.0 -72000.0
146 7300.0 -73000.0
148 7400.0 -74000.0
150 7500.0 -75000.0
152 7600.0 -76000.0
154 7700.0 -77000.0
156 7800.0 -78000.0
158 7900.0 -79000.0
160 8000.0 -80000.0
162 8100.0 -81000.0
164 8200.0 -82000.0
166 8300.0 -83000.0
168 8400.0 -84000.0
170 8500.0 -85000.0
172 8600.0 -86000.0
174 8700.0 -87000.0
176 8800.0 -88000.0
178 8900.0 -89000.0
180 9000.0 -90000.0
182 9100.0 -91000.0
184 9200.0 -92000.0
186 9300.0 -93000.0
188 9400.0 -94000.0
190 9500.0 -95000.0
192 9600.0 -96000.0
194 9700.0 -97000.0
196 9800.0 -98000.0
198 9900.0 -99000.0
200 10000.0 -100000.0
202 10100.0 -101000.0
204 10200.0 -102000.0
206 10300.0 -103000.0
208 10400.0 -104000.0
210 10500.0 -105000.0
212 10600.0 -106000.0
214 10700.0 -107000.0
216 10800.0 -108000.0
218 10900.0 -109000.0
220 11000.0 -110000.0
222 11100.0 -111000.0
224 11200.0 -112000.0
226 11300.0 -113000.0
228 11400.0 -114000.0
230 11500.0 -115000.0
232 11600.0 -116000.0
234 11700.0 -117000.0
236 11800.0 -118000.0
238 11900.0 -119000.0
240 12000.0 -120000.0
242 12100.0 -121000.0
244 12200.0 -122000.0
246 12300.0 -123000.0
248 12400.0 -124000.0
250 12500.0 -125000.0
252 12600.0 -126000.0
254 12700.0 -127000.0
256 12800.0 -128000.0

As you can see there are record markers 1400 0000 on each record, meaning the records are 20 bytes long (1 fortran INTEGER@4, 2 fortran DOUBLE@8) with 4 bytes of record marker at each end - which I read into a variable called _ and discarded in Python. If you don't want that, you need to use DIRECT /STREAM output in fortran - but I don't know if you can control that at the other end of your processing chain. See here for a description.

Keywords: fortran, Python, binary, unformatted, xxd, dump, records, view as integer, view as float, view as float64, view as double

Fastest way to read a binary file with a defined format?

There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.

line_cols = 20              #For example
line_rows = 40000 #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5 #For example
with open(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]

Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.

This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.

In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.

How to read a binary file with a known header and file format for data analysis?

Based on this description it is difficult to say how one could read the header, since this will depend on its specific structure. It should be possible though to read the rest of the file.

Start by reading the file as a byte array:

with open(testFile, 'rb') as testData:
data = testData.read()

len(data) will give the number of bytes. Assuming that the header consists of fewer than 720^2 bytes, and that the rest of the bytes is subdivided into images 720^2 bytes each, the reminder from the division of len(data) by 720^2 will give the length of the header:

len_header = len(data) % 720**2

You can then disregard the header and convert the remaining bytes into integers:

pixels = [b for b in data[len_header:]]

Next, you can use numpy to rearrange this list into a 2-dimensional array with 720^2 columns, so that each row consists of pixels of a single image:

import numpy as np

images = np.array(pixels).reshape(-1, 720**2)

Each image can be now accessed as images[i] where i is the index of a row. This is a 1-dimensional array, so to make it into a 2-dimensional structure representing an image reshape again:

images[i].reshape(720, 720)  

Finally, you can use matplotlib to display the image and check if it looks correctly:

import matplotlib.pyplot as plt

plt.imshow(images[i].reshape(720, 720), cmap="gray_r")
plt.show()

Reading binary files using python

You can use struct module to convert binary string into float.

For example :

import struct
x = struct.unpack('f', b'A0\x00\x00')[0]

Give 1.7310239929804465e-41 as output

It's probably not the most elegant solution because you have to convert 'extras' to string.

x = struct.unpack('f', str(extras))[0]

Reading integers from binary file in Python

The read method returns a sequence of bytes as a string. To convert from a string byte-sequence to binary data, use the built-in struct module: http://docs.python.org/library/struct.html.

import struct

print(struct.unpack('i', fin.read(4)))

Note that unpack always returns a tuple, so struct.unpack('i', fin.read(4))[0] gives the integer value that you are after.

You should probably use the format string '<i' (< is a modifier that indicates little-endian byte-order and standard size and alignment - the default is to use the platform's byte ordering, size and alignment). According to the BMP format spec, the bytes should be written in Intel/little-endian byte order.



Related Topics



Leave a reply



Submit