What Is the Fastest Way to Create a Checksum for Large Files in C#

What is the fastest way to create a checksum for large files in C#

The problem here is that SHA256Managed reads 4096 bytes at a time (inherit from FileStream and override Read(byte[], int, int) to see how much it reads from the filestream), which is too small a buffer for disk IO.

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream in BufferedStream and set reasonably-sized buffer size (I tried with ~1 Mb buffer):

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
// The rest remains the same
}

How to get a checksum that represents a collection of files?

Hashing hashes is not optimal. However, if you didn't want to hash all the files together, you could easily just add your hashes to a memory stream and hash that.

Disregarding any other problem conceptual or otherwise.

public static byte[] Hash(IEnumerable<byte[]> source)
{
using var hash = SHA256.Create();
var ms = new MemoryStream();
foreach (var bytes in source)
ms.Write(bytes, 0, bytes.Length);
ms.Seek(0, SeekOrigin.Begin);
return hash.ComputeHash(ms);
}

Note : I am not professing this is the best solution, it's just a solution to your immediate problem

A slightly less allocatey approach

public static byte[] Hash(IList<byte[]> source)
{
using var hash = SHA256.Create();
var ms = new MemoryStream(source.Sum(x =>x.Length));
foreach (var bytes in source)
ms.Write(bytes, 0, bytes.Length);
ms.Seek(0, SeekOrigin.Begin);
return hash.ComputeHash(ms);
}

For a multi file hash (untested)

public static byte[] Hash(IEnumerable<string> source)
{

using var hash = SHA256.Create();
hash.Initialize();

// adjust to what is fastest for you, for hdd 4k to 10k might be appropriate.
// for ssd larger will likely help
// probably best to keep it under 80k so it doesn't end up on LOH (up to you)
const int bufferSize = 1024 * 50;

var buffer = new byte[bufferSize];
foreach (var file in source)
{
using var fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Delete, bufferSize, FileOptions.SequentialScan);
var bytesRead = 0;
while ((bytesRead = fs.Read(buffer, 0, bufferSize)) != 0)
hash.TransformBlock(buffer, 0, bytesRead, buffer, 0);
hash.TransformFinalBlock(buffer, 0, 0);
}

return hash.Hash;
}

Computing MD5SUM of large files in C#

I suggest using the alternate method:

MD5CryptoServiceProvider.ComputeHash(Stream)

and just pass in an input stream opened on your file. This method will almost certainly not read in the whole file in memory in one go.

I would also note that in most implementations of MD5 it's possible to add byte[] data into the digest function a chunk at a time, and then ask for the hash at the end.

Calculate MD5 checksum for a file

It's very simple using System.Security.Cryptography.MD5:

using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return md5.ComputeHash(stream);
}
}

(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)

How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)

If you need to represent the hash as a string, you could convert it to hex using BitConverter:

static string CalculateMD5(string filename)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
var hash = md5.ComputeHash(stream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}

Faster MD5 alternative?

I hope you're checking for an MD5 match only if the file size already matches.

Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.

Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.

How do I optimize calculating the hash of thousands of files?

Create a pipeline of work, the easiest way I know how to create a pipeline that uses both parts of the code that must be single threaded and parts that must be multi-threaded is to use TPL Dataflow

public static class Example
{
private class Dto
{
public Dto(string filePath, byte[] data)
{
FilePath = filePath;
Data = data;
}

public string FilePath { get; }
public byte[] Data { get; }
}

public static async Task ProcessFiles(string path)
{
var getFilesBlock = new TransformBlock<string, Dto>(filePath => new Dto(filePath, File.ReadAllBytes(filePath))); //Only lets one thread do this at a time.

var hashFilesBlock = new TransformBlock<Dto, Dto>(dto => HashFile(dto),
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = Environment.ProcessorCount, //We can multi-thread this part.
BoundedCapacity = 50}); //Only allow 50 byte[]'s to be waiting in the queue. It will unblock getFilesBlock once there is room.

var writeToDatabaseBlock = new ActionBlock<Dto>(WriteToDatabase,
new ExecutionDataflowBlockOptions {BoundedCapacity = 50});//MaxDegreeOfParallelism defaults to 1 so we don't need to specifiy it.

//Link the blocks together.
getFilesBlock.LinkTo(hashFilesBlock, new DataflowLinkOptions {PropagateCompletion = true});
hashFilesBlock.LinkTo(writeToDatabaseBlock, new DataflowLinkOptions {PropagateCompletion = true});

//Queue the work for the first block.
foreach (var filePath in Directory.EnumerateFiles(path))
{
await getFilesBlock.SendAsync(filePath).ConfigureAwait(false);
}

//Tell the first block we are done adding files.
getFilesBlock.Complete();

//Wait for the last block to finish processing its last item.
await writeToDatabaseBlock.Completion.ConfigureAwait(false);
}

private static Dto HashFile(Dto dto)
{
using (var md5 = System.Security.Cryptography.MD5.Create())
{
return new Dto(dto.FilePath, md5.ComputeHash(dto.Data));
}
}

private static async Task WriteToDatabase(Dto arg)
{
//Write to the database here.
}
}

This creates a pipeline with 3 segments.

One that is single threaded that reads the files from the hard drive in to memory and stored as a byte[].

A second one that can use up to Enviorement.ProcessorCount threads to hash the files, it will only allow 50 items to be sitting on it's inbound queue, when the first block tries to add it will stop processing new items until the next block is ready to accept new items.

And a third one that is single threaded and adds the data to the database, it allows only 50 items in it's inbound queue at a time.

Because of the two 50 limits there will be at most 100 byte[] in memory (50 in hashFilesBlock queue, 50 in the writeToDatabaseBlock queue, items currently being processed count toward the BoundedCapacity limit.


Update: for fun I wrote a version that reports progress too, it's untested though and uses C# 7 features.

using System;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;

public static class Example
{
private class Dto
{
public Dto(string filePath, byte[] data)
{
FilePath = filePath;
Data = data;
}

public string FilePath { get; }
public byte[] Data { get; }
}

public static async Task ProcessFiles(string path, IProgress<ProgressReport> progress)
{
int totalFilesFound = 0;
int totalFilesRead = 0;
int totalFilesHashed = 0;
int totalFilesUploaded = 0;

DateTime lastReported = DateTime.UtcNow;

void ReportProgress()
{
if (DateTime.UtcNow - lastReported < TimeSpan.FromSeconds(1)) //Try to fire only once a second, but this code is not perfect so you may get a few rapid fire.
{
return;
}
lastReported = DateTime.UtcNow;
var report = new ProgressReport(totalFilesFound, totalFilesRead, totalFilesHashed, totalFilesUploaded);
progress.Report(report);
}

var getFilesBlock = new TransformBlock<string, Dto>(filePath =>
{
var dto = new Dto(filePath, File.ReadAllBytes(filePath));
totalFilesRead++; //safe because single threaded.
return dto;
});

var hashFilesBlock = new TransformBlock<Dto, Dto>(inDto =>
{
using (var md5 = System.Security.Cryptography.MD5.Create())
{
var outDto = new Dto(inDto.FilePath, md5.ComputeHash(inDto.Data));
Interlocked.Increment(ref totalFilesHashed); //Need the interlocked due to multithreaded.
ReportProgress();
return outDto;
}
},
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = Environment.ProcessorCount, BoundedCapacity = 50});

var writeToDatabaseBlock = new ActionBlock<Dto>(arg =>
{
//Write to database here.
totalFilesUploaded++;
ReportProgress();
},
new ExecutionDataflowBlockOptions {BoundedCapacity = 50});

getFilesBlock.LinkTo(hashFilesBlock, new DataflowLinkOptions {PropagateCompletion = true});
hashFilesBlock.LinkTo(writeToDatabaseBlock, new DataflowLinkOptions {PropagateCompletion = true});

foreach (var filePath in Directory.EnumerateFiles(path))
{
await getFilesBlock.SendAsync(filePath).ConfigureAwait(false);
totalFilesFound++;
ReportProgress();
}

getFilesBlock.Complete();

await writeToDatabaseBlock.Completion.ConfigureAwait(false);
ReportProgress();
}
}

public class ProgressReport
{
public ProgressReport(int totalFilesFound, int totalFilesRead, int totalFilesHashed, int totalFilesUploaded)
{
TotalFilesFound = totalFilesFound;
TotalFilesRead = totalFilesRead;
TotalFilesHashed = totalFilesHashed;
TotalFilesUploaded = totalFilesUploaded;
}

public int TotalFilesFound { get; }
public int TotalFilesRead{ get; }
public int TotalFilesHashed{ get; }
public int TotalFilesUploaded{ get; }
}

Performance issues while creating file checksums

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)

Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:

  • Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
  • The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
  • Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
  • Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
  • Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
  • And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
  • As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.

All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.

And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.



Related Topics



Leave a reply



Submit