How to Extract a Single Chunk of Bytes from Within a File

How do I extract a single chunk of bytes from within a file?

Try dd:

dd skip=102567 count=253 if=input.binary of=output.binary bs=1

The option bs=1 sets the block size, making dd read and write one byte at a time. The default block size is 512 bytes.

The value of bs also affects the behavior of skip and count since the numbers in skip and count are the numbers of blocks that dd will skip and read/write, respectively.

how to extract specific bytes from a file using unix

Use dd:

dd bs=1 seek=60 count=12 if=file.bin of=output

You can write a shell loop to substitute the numbers.

You could also consider using awk, Perl or Python, if there's a lot of them to do or it needs to be really fast.

Extract specific bytes from a binary file in Python

I would definitely try mmap():

https://docs.python.org/2/library/mmap.html

You're reading a lot of small bits which has a lot of system call overhead if you're calling seek() and read() for every int16 you are extracting.

I've written a small test to demonstrate:

#!/usr/bin/python

import mmap
import os
import struct
import sys

FILE = "/opt/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10

def byfile():
    sum = 0
    with open(FILE, "r") as fd:
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            fd.seek(offset)
            data = fd.read(BYTES)
            sum += struct.unpack('h', data)[0]
    return sum

def bymmap():
    sum = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
    return sum

if sys.argv[1] == 'mmap':
    print bymmap()

if sys.argv[1] == 'file':
    print byfile()

I ran each method twice to compensate for caching. I used time because I wanted to measure user and sys time.

Here are the results:

[centos7:/tmp]$ time ./test.py file
-211990391

real    0m44.656s
user    0m35.978s
sys     0m8.697s
[centos7:/tmp]$ time ./test.py file
-211990391

real    0m43.091s
user    0m37.571s
sys     0m5.539s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.712s
user    0m15.495s
sys     0m1.227s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.942s
user    0m15.846s
sys     0m1.104s
[centos7:/tmp]$

(The sum -211990391 just validates both versions do the same thing.)

Looking at each version's 2nd result, mmap() is ~1/3rd of the real time. User time is ~1/2 and system time is ~1/5th.

Your other options for perhaps speeding this up are:

(1) As you mentioned, load the whole file. The large I/O's instead of the small I/O's could speed things up. If you exceed system memory, though, you'll fall back to paging, which will be worse than mmap() (since you have to page out). I'm not super hopeful here because mmap is already using larger I/O's.

(2) Concurrency. Maybe reading the file in parallel through multiple threads could speed things up, but you'll have the Python GIL to deal with. Multiprocessing will work better by avoiding the GIL, and you could easily pass your data back to a top level handler. This will, however, work against the next item, locality: You might make your I/O more random.

(3) Locality. Somehow organize your data (or order your reads) so that your data is closer together. mmap() pages the file in chunks according to the system pagesize:

>>> import mmap
>>> mmap.PAGESIZE
4096
>>> mmap.ALLOCATIONGRANULARITY
4096
>>>

If your data is closer together (within the 4k chunk), it will already have been loaded into the buffer cache.

(4) Better hardware. Like an SSD.

I did run this on an SSD and it was much faster. I ran a profile of the python, wondering if the unpack was expensive. It's not:

$ python -m cProfile test.py mmap                                                                                                                        
121679286
         26843553 function calls in 8.369 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    6.204    6.204    8.357    8.357 test.py:24(bymmap)
        1    0.012    0.012    8.369    8.369 test.py:3(<module>)
 26843546    1.700    0.000    1.700    0.000 {_struct.unpack}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.stat}
        1    0.453    0.453    0.453    0.453 {range}

Addendum:

Curiosity got the best of me and I tried out multiprocessing. I need to look at my partitioning closer, but the number of unpacks (53687092) is the same across trials:

$ time ./test2.py 4
[(4415068.0, 13421773), (-145566705.0, 13421773), (14296671.0, 13421773), (109804332.0, 13421773)]
(-17050634.0, 53687092)

real    0m5.629s
user    0m17.756s
sys     0m0.066s
$ time ./test2.py 1
[(264140374.0, 53687092)]
(264140374.0, 53687092)

real    0m13.246s
user    0m13.175s
sys     0m0.060s

Code:

#!/usr/bin/python

import functools
import multiprocessing
import mmap
import os
import struct
import sys

FILE = "/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10

def bymmap(poolsize, n):
    partition = SIZE/poolsize
    initial = n * partition
    end = initial + partition
    sum = 0.0
    unpacks = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in xrange(initial, end, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
            unpacks += 1
    return (sum, unpacks)

poolsize = int(sys.argv[1])
pool = multiprocessing.Pool(poolsize)
results = pool.map(functools.partial(bymmap, poolsize), range(0, poolsize))
print results
print reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), results)

How to grab an arbitrary chunk from a file on Unix/Linux

Yes it's awkward to do this with dd today. We're considering adding skip_bytes and count_bytes params to dd in coreutils to help. The following should work though:

#!/bin/sh

bs=100000
infile=$1
skip=$2
length=$3

(
  dd bs=1 skip=$skip count=0
  dd bs=$bs count=$(($length / $bs))
  dd bs=$(($length % $bs)) count=1
) < "$infile"

How do I extract bytes with offsets from a huge block efficiently in Python?

Make it mutable and delete the the unwanted slice?

>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'

If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join:

>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'

(Btw according to PEP 8 it should be block[i : i+3], not block[i:i + 3], and for good reason.)

Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray one.

Benchmark with block = b'0123456789AB' * 100_000 (much smaller than the 1GB you mentioned in the comments below):

    0.00 ms      0.00 ms      0.00 ms  baseline
15267.60 ms  14724.33 ms  14712.70 ms  original
    2.46 ms      2.46 ms      3.45 ms  Kelly_Bundy_bytearray
   83.66 ms     85.27 ms    122.88 ms  Kelly_Bundy_join

Benchmark code:

import timeit

def baseline(block):
    pass

def original(block):
    result = b''
    for i in range(0, len(block), 4):
        result += block[i:i + 3]
    return result

def Kelly_Bundy_bytearray(block):
    tmp = bytearray(block)
    del tmp[3::4]
    return bytes(tmp)

def Kelly_Bundy_join(block):
    return b''.join([block[i : i+3] for i in range(0, len(block), 4)])

funcs = [
    baseline,
    original,
    Kelly_Bundy_bytearray,
    Kelly_Bundy_join,
    ]

block = b'0123456789AB' * 100_000
args = block,
number = 10**0

expect = original(*args)
for func in funcs:
    print(func(*args) == expect, func.__name__)
print()

tss = [[] for _ in funcs]
for _ in range(3):
    for func, ts in zip(funcs, tss):
        t = min(timeit.repeat(lambda: func(*args), number=number)) / number
        ts.append(t)
        print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
    print()

Python Regular Expression Extract Chunk of Data From Binary File

You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.

In Perl, its something like this

Perl test case

$strLangs =  "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";

# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;

# Split on 5 or more 0's
@Alllangs = split /\x00{5,}/, $strLangs;

# Print each language characters
foreach $lang (@Alllangs)
{
    print "<";
    for ( split //, $lang ) {
       printf( "%x,", ord($_)); 
    }
    print ">\n";

}

Output >>

<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>

How can I split a binary file into chunks with certain size with batch script without external software?

Some time ago I wrote a Batch-JScript hybrid script called BinToBat.bat with this purpose. This is its help screen:

Create an installer Batch program for data files of any type

BINTOBAT [/T:.ext1.ext2...] [/L:lineSize] [/F[:fileSize]] filename ...

  /T:.ext1.ext2    Specify the extensions of text type files that will not be
                   encoded as hexadecimal digits, but preserved as text.
  /L:lineSize      Specify the size of output lines (default: 78).
  /F[:fileSize]    /F switch specify to generate a Full installer file.
                   The optional fileSize specify the maximum output file size.

BinToBat encode the given data files as hexadecimal digits (or preserve they
as compressed text) and insert they into InstallFiles.bat program; when this
program run, it generates the original data files.

You may rename the InstallFiles.bat program as you wish, but preserving the
"Install" prefix is suggested.

You may use wild-cards in the filename list.

If the /F switch is not given, a Partial installer is created:
- You may insert a short description for each file.
- You may insert divisions in the file listing via a dash in the parameters.
- The installer allows to select which files will be downloaded and ask
  before overwrite existent files.

If the /F switch is given, a Full installer is created:
- The installer always download all files.
- You may specify commands that will be executed after the files were copied.
- You may specify the maximum size of the output file via /F:fileFize, so in
  this case the output file will be divided in parts with a numeric postfix.

  If you use /F switch you can NOT rename the InstallFiles??.bat files; the
  first one is the installer and the rest just contain data.

You may download BinToBat.bat program from this site.

get a part of a binary file using gnu-coreutils, bash

If you work with binary, I advise you to use dd command.

dd if=trunk1.gz bs=1 skip=480161397 count=9051 of=output.bin

bs is the block size and is set to 1 byte.