How do I extract a single chunk of bytes from within a file?
Try dd
:
dd skip=102567 count=253 if=input.binary of=output.binary bs=1
The option bs=1
sets the block size, making dd
read and write one byte at a time. The default block size is 512 bytes.
The value of bs
also affects the behavior of skip
and count
since the numbers in skip
and count
are the numbers of blocks that dd
will skip and read/write, respectively.
how to extract specific bytes from a file using unix
Use dd
:
dd bs=1 seek=60 count=12 if=file.bin of=output
You can write a shell loop to substitute the numbers.
You could also consider using awk
, Perl or Python, if there's a lot of them to do or it needs to be really fast.
Extract specific bytes from a binary file in Python
I would definitely try mmap()
:
https://docs.python.org/2/library/mmap.html
You're reading a lot of small bits which has a lot of system call overhead if you're calling seek()
and read()
for every int16
you are extracting.
I've written a small test to demonstrate:
#!/usr/bin/python
import mmap
import os
import struct
import sys
FILE = "/opt/tmp/random" # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10
def byfile():
sum = 0
with open(FILE, "r") as fd:
for offset in range(0, SIZE/BYTES, SKIP*BYTES):
fd.seek(offset)
data = fd.read(BYTES)
sum += struct.unpack('h', data)[0]
return sum
def bymmap():
sum = 0
with open(FILE, "r") as fd:
mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
for offset in range(0, SIZE/BYTES, SKIP*BYTES):
data = mm[offset:offset+BYTES]
sum += struct.unpack('h', data)[0]
return sum
if sys.argv[1] == 'mmap':
print bymmap()
if sys.argv[1] == 'file':
print byfile()
I ran each method twice to compensate for caching. I used time
because I wanted to measure user
and sys
time.
Here are the results:
[centos7:/tmp]$ time ./test.py file
-211990391
real 0m44.656s
user 0m35.978s
sys 0m8.697s
[centos7:/tmp]$ time ./test.py file
-211990391
real 0m43.091s
user 0m37.571s
sys 0m5.539s
[centos7:/tmp]$ time ./test.py mmap
-211990391
real 0m16.712s
user 0m15.495s
sys 0m1.227s
[centos7:/tmp]$ time ./test.py mmap
-211990391
real 0m16.942s
user 0m15.846s
sys 0m1.104s
[centos7:/tmp]$
(The sum -211990391 just validates both versions do the same thing.)
Looking at each version's 2nd result, mmap()
is ~1/3rd of the real time. User time is ~1/2 and system time is ~1/5th.
Your other options for perhaps speeding this up are:
(1) As you mentioned, load the whole file. The large I/O's instead of the small I/O's could speed things up. If you exceed system memory, though, you'll fall back to paging, which will be worse than mmap()
(since you have to page out). I'm not super hopeful here because mmap
is already using larger I/O's.
(2) Concurrency. Maybe reading the file in parallel through multiple threads could speed things up, but you'll have the Python GIL to deal with. Multiprocessing will work better by avoiding the GIL, and you could easily pass your data back to a top level handler. This will, however, work against the next item, locality: You might make your I/O more random.
(3) Locality. Somehow organize your data (or order your reads) so that your data is closer together. mmap()
pages the file in chunks according to the system pagesize:
>>> import mmap
>>> mmap.PAGESIZE
4096
>>> mmap.ALLOCATIONGRANULARITY
4096
>>>
If your data is closer together (within the 4k chunk), it will already have been loaded into the buffer cache.
(4) Better hardware. Like an SSD.
I did run this on an SSD and it was much faster. I ran a profile of the python, wondering if the unpack was expensive. It's not:
$ python -m cProfile test.py mmap
121679286
26843553 function calls in 8.369 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 6.204 6.204 8.357 8.357 test.py:24(bymmap)
1 0.012 0.012 8.369 8.369 test.py:3(<module>)
26843546 1.700 0.000 1.700 0.000 {_struct.unpack}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'fileno' of 'file' objects}
1 0.000 0.000 0.000 0.000 {open}
1 0.000 0.000 0.000 0.000 {posix.stat}
1 0.453 0.453 0.453 0.453 {range}
Addendum:
Curiosity got the best of me and I tried out multiprocessing
. I need to look at my partitioning closer, but the number of unpacks (53687092) is the same across trials:
$ time ./test2.py 4
[(4415068.0, 13421773), (-145566705.0, 13421773), (14296671.0, 13421773), (109804332.0, 13421773)]
(-17050634.0, 53687092)
real 0m5.629s
user 0m17.756s
sys 0m0.066s
$ time ./test2.py 1
[(264140374.0, 53687092)]
(264140374.0, 53687092)
real 0m13.246s
user 0m13.175s
sys 0m0.060s
Code:
#!/usr/bin/python
import functools
import multiprocessing
import mmap
import os
import struct
import sys
FILE = "/tmp/random" # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10
def bymmap(poolsize, n):
partition = SIZE/poolsize
initial = n * partition
end = initial + partition
sum = 0.0
unpacks = 0
with open(FILE, "r") as fd:
mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
for offset in xrange(initial, end, SKIP*BYTES):
data = mm[offset:offset+BYTES]
sum += struct.unpack('h', data)[0]
unpacks += 1
return (sum, unpacks)
poolsize = int(sys.argv[1])
pool = multiprocessing.Pool(poolsize)
results = pool.map(functools.partial(bymmap, poolsize), range(0, poolsize))
print results
print reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), results)
How to grab an arbitrary chunk from a file on Unix/Linux
Yes it's awkward to do this with dd today. We're considering adding skip_bytes and count_bytes params to dd in coreutils to help. The following should work though:
#!/bin/sh
bs=100000
infile=$1
skip=$2
length=$3
(
dd bs=1 skip=$skip count=0
dd bs=$bs count=$(($length / $bs))
dd bs=$(($length % $bs)) count=1
) < "$infile"
How do I extract bytes with offsets from a huge block efficiently in Python?
Make it mutable and delete the the unwanted slice?
>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'
If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join
:
>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'
(Btw according to PEP 8 it should be block[i : i+3]
, not block[i:i + 3]
, and for good reason.)
Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray
one.
Benchmark with block = b'0123456789AB' * 100_000
(much smaller than the 1GB you mentioned in the comments below):
0.00 ms 0.00 ms 0.00 ms baseline
15267.60 ms 14724.33 ms 14712.70 ms original
2.46 ms 2.46 ms 3.45 ms Kelly_Bundy_bytearray
83.66 ms 85.27 ms 122.88 ms Kelly_Bundy_join
Benchmark code:
import timeit
def baseline(block):
pass
def original(block):
result = b''
for i in range(0, len(block), 4):
result += block[i:i + 3]
return result
def Kelly_Bundy_bytearray(block):
tmp = bytearray(block)
del tmp[3::4]
return bytes(tmp)
def Kelly_Bundy_join(block):
return b''.join([block[i : i+3] for i in range(0, len(block), 4)])
funcs = [
baseline,
original,
Kelly_Bundy_bytearray,
Kelly_Bundy_join,
]
block = b'0123456789AB' * 100_000
args = block,
number = 10**0
expect = original(*args)
for func in funcs:
print(func(*args) == expect, func.__name__)
print()
tss = [[] for _ in funcs]
for _ in range(3):
for func, ts in zip(funcs, tss):
t = min(timeit.repeat(lambda: func(*args), number=number)) / number
ts.append(t)
print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
print()
Python Regular Expression Extract Chunk of Data From Binary File
You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.
In Perl, its something like this
Perl test case
$strLangs = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";
# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;
# Split on 5 or more 0's
@Alllangs = split /\x00{5,}/, $strLangs;
# Print each language characters
foreach $lang (@Alllangs)
{
print "<";
for ( split //, $lang ) {
printf( "%x,", ord($_));
}
print ">\n";
}
Output >>
<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>
How can I split a binary file into chunks with certain size with batch script without external software?
Some time ago I wrote a Batch-JScript hybrid script called BinToBat.bat with this purpose. This is its help screen:
Create an installer Batch program for data files of any type
BINTOBAT [/T:.ext1.ext2...] [/L:lineSize] [/F[:fileSize]] filename ...
/T:.ext1.ext2 Specify the extensions of text type files that will not be
encoded as hexadecimal digits, but preserved as text.
/L:lineSize Specify the size of output lines (default: 78).
/F[:fileSize] /F switch specify to generate a Full installer file.
The optional fileSize specify the maximum output file size.
BinToBat encode the given data files as hexadecimal digits (or preserve they
as compressed text) and insert they into InstallFiles.bat program; when this
program run, it generates the original data files.
You may rename the InstallFiles.bat program as you wish, but preserving the
"Install" prefix is suggested.
You may use wild-cards in the filename list.
If the /F switch is not given, a Partial installer is created:
- You may insert a short description for each file.
- You may insert divisions in the file listing via a dash in the parameters.
- The installer allows to select which files will be downloaded and ask
before overwrite existent files.
If the /F switch is given, a Full installer is created:
- The installer always download all files.
- You may specify commands that will be executed after the files were copied.
- You may specify the maximum size of the output file via /F:fileFize, so in
this case the output file will be divided in parts with a numeric postfix.
If you use /F switch you can NOT rename the InstallFiles??.bat files; the
first one is the installer and the rest just contain data.
You may download BinToBat.bat program from this site.
get a part of a binary file using gnu-coreutils, bash
If you work with binary, I advise you to use dd
command.
dd if=trunk1.gz bs=1 skip=480161397 count=9051 of=output.bin
bs
is the block size and is set to 1 byte.
Related Topics
How to Find or Calculate a Linux Process's Page Table Size and Other Kernel Accounting
How to Check If Hadoop Daemons Are Running
How to Make Debian Package Install Dependencies
Does Gcc, Icc, or Microsoft's C/C++ Compiler Support or Know Anything About Numa
Linux Bash: Move Multiple Different Files into Same Directory
What Sort Order Does Linux Use
Copy Folder Structure (Without Files) from One Location to Another
Read Line by Line in Bash Script
Output File Lines from Last to First in Bash
Refresh Net.Core.Somaxcomm (Or Any Sysctl Property) for Docker Containers
Gurus Say That Ld_Library_Path Is Bad - What's the Alternative
Insmod Error: Inserting './Hello.Ko': -1 Invalid Module Format"
Device Number in Stat Command Output
What Does Double-Dash Do When Following a Command