How to Read/Stream a File Without Loading the Entire File into Memory

How can I read/stream a file without loading the entire file into memory?

Here's an example of how to read a file in chunks of 1KB without loading the entire contents into memory:

const int chunkSize = 1024; // read the file by chunks of 1KB
using (var file = File.OpenRead("foo.dat"))
{
int bytesRead;
var buffer = new byte[chunkSize];
while ((bytesRead = file.Read(buffer, 0, buffer.Length)) > 0)
{
// TODO: Process bytesRead number of bytes from the buffer
// not the entire buffer as the size of the buffer is 1KB
// whereas the actual number of bytes that are read are
// stored in the bytesRead integer.
}
}

Open a file without loading it into memory at once

Open a FileStream and use the Seek method to go to the end of the file. From there, go backwards, reading one byte at a time. This will read in reverse order. So, until you reach the beginning of the file, loop:

read 1 byte
// do whatever you want with that byte...write to another file?
seek back 2 bytes

As to efficiency, you can read a buffer of, say, 1024 bytes in memory. That way, you don't issue Read operations for each byte of the file. Once you have the buffer filled, reverse it and you're good to go.

Is it possible to read a file without loading it into memory?

I need the content to do a checksum, so I need the complete message

Many checksum libraries support incremental updates to the checksum. For example, the GLib has g_checksum_update(). So you can read the file a block at a time with fread and update the checksum as you read.

#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <glib.h>

int main(void) {
char filename[] = "test.txt";

// Create a SHA256 checksum
GChecksum *sum = g_checksum_new(G_CHECKSUM_SHA256);
if( sum == NULL ) {
fprintf(stderr, "Could not create checksum.\n");
exit(1);
}

// Open the file we'll be checksuming.
FILE *fp = fopen( filename, "rb" );
if( fp == NULL ) {
fprintf(stderr, "Could not open %s: %s.\n", filename, strerror(errno));
exit(1);
}

// Read one buffer full at a time (BUFSIZ is from stdio.h)
// and update the checksum.
unsigned char buf[BUFSIZ];
size_t size_read = 0;
while( (size_read = fread(buf, 1, sizeof(buf), fp)) != 0 ) {
// Update the checksum
g_checksum_update(sum, buf, (gssize)size_read);
}

// Print the checksum.
printf("%s %s\n", g_checksum_get_string(sum), filename);
}

And we can check it works by comparing the result with sha256sum.

$ ./test
0c46af5bce717d706cc44e8c60dde57dbc13ad8106a8e056122a39175e2caef8 test.txt
$ sha256sum test.txt
0c46af5bce717d706cc44e8c60dde57dbc13ad8106a8e056122a39175e2caef8 test.txt

Is there any faster way to search in huge file without loading it into memory?

I propose to use modified version of Knuth Moriss Pratt algorithm.
algorithm kmp_search:
input:
a stream of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an integer (the zero-based position in S at which W is found)

define variables:
an integer, m ← 0 (the beginning of the current match in S)
an integer, i ← 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)

while m + i < length(S) do
if W[i] = S[m + i] then
if i = length(W) - 1 then
return m
let i ← i + 1
else
if T[i] > -1 then
let m ← m + i - T[i], i ← T[i]
else
let m ← m + 1, i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

The text string can be streamed in because the KMP algorithm does not backtrack in the text. (This is another improvement over the naive algorithm, which doesn’t naturally support streaming.) If streaming, the amortized time to process an incoming character is Ɵ(1) but the worst-case time is Ɵ(min(m, n′)), where n′ is the number of text characters seen so far. Source

Referecne (Java) implementation could be found here

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher {

protected byte[] pattern_;
protected int[] borders_;

// An upper bound on pattern length for searching. Results are undefined for longer patterns.
public static final int MAX_PATTERN_LENGTH = 1024;

public StreamSearcher(byte[] pattern) {
setPattern(pattern);
}

/**
* Sets a new pattern for this StreamSearcher to use.
* @param pattern
* the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern) {
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}

/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
* @throws IOException
*/
public long search(InputStream stream) throws IOException {
long bytesRead = 0;

int b;
int j = 0;

while ((b = stream.read()) != -1) {
bytesRead++;

while (j >= 0 && (byte)b != pattern_[j]) {
j = borders_[j];
}
// Move to the next character in the pattern.
++j;

// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length) {
return bytesRead;
}
}

// No dice, Note that the stream is now completely consumed.
return -1;
}

/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
protected void preProcess() {
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length) {
while (j >= 0 && pattern_[i] != pattern_[j]) {
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}

Similar question: Efficient way to search a stream for a string

Node js Stream file without saving to memory

You can use busboy at a bit lower level and get access to it's translated readstream. Here's an example from the busboy doc that can be adapted for your situation:

http.createServer(function(req, res) {
if (req.method === 'POST') {
var busboy = new Busboy({ headers: req.headers });
busboy.on('file', function(fieldname, file, filename, encoding, mimetype) {
var saveTo = path.join(os.tmpDir(), path.basename(fieldname));
file.pipe(fs.createWriteStream(saveTo));
});
busboy.on('finish', function() {
res.writeHead(200, { 'Connection': 'close' });
res.end("That's all folks!");
});
return req.pipe(busboy);
}
res.writeHead(404);
res.end();
}).listen(8000, function() {
console.log('Listening for requests');
});

The key part is this which I've annotated:

    // create a new busboy instance on each incoming request that has files with it
var busboy = new Busboy({ headers: req.headers });

// register for the file event
busboy.on('file', function(fieldname, file, filename, encoding, mimetype) {
// at this point the file argument is a readstream for the data of an uploaded file
// you can do whatever you want with this readstream such as
// feed it directly to your anti-virus

// this example code saves it to a tempfile
// you would replace this with code that sends the stream to your anti-virus
var saveTo = path.join(os.tmpDir(), path.basename(fieldname));
file.pipe(fs.createWriteStream(saveTo));
});

// this recognizes the end of the upload stream and sends
// whatever you want the final http response to be
busboy.on('finish', function() {
res.writeHead(200, { 'Connection': 'close' });
res.end("That's all folks!");
});

// this gets busboy started, feeding the incoming request to busboy
// so it can start reading it and parsing it and will eventually trigger
// one or more "file" events
return req.pipe(busboy);

When you've identified an incoming request that you want to do this custom busboy operation in, you create an instance of Busboy, pass it the headers and register for the file event. That file event gives you a new file readstream that is the converted file as a readstream. You could then pipe that stream directly to your anti-virus without ever going through the file system.

Is there a way to seek through a file without loading the whole thing into an array?

For the purpose you can use the each_line iterator, combined with with_index to have the line number of the current line (counting from 0):

File.open('myfile') do |file|

file.each_line.with_index do |line, lineno|
case lineno
when 0
# line 1
when 21
# line 22
end
end

end

By using open, passing a block to it, instead of new, you are guaranteed that the file is properly closed at the end of the block execution.


Update The with_index method accepts an optional argument to specify the starting index to use, so che code above could be better written like this:

file.each_line.with_index(1) do |line, lineno|
case lineno
when 1
# line 1
end
end


Related Topics



Leave a reply



Submit