Swift Calculate Md5 Checksum for Large Files

Swift Calculate MD5 Checksum for Large Files

You can compute the MD5 checksum in chunks, as demonstrated
e.g. in Is there a MD5 library that doesn't require the whole input at the same time?.

Here is a possible implementation using Swift (now updated for Swift 5)

import CommonCrypto

func md5File(url: URL) -> Data? {

    let bufferSize = 1024 * 1024

    do {
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: url)
        defer {
            file.closeFile()
        }

        // Create and initialize MD5 context:
        var context = CC_MD5_CTX()
        CC_MD5_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
        while autoreleasepool(invoking: {
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_MD5_Update(&context, $0.baseAddress, numericCast(data.count))
                }
                return true // Continue
            } else {
                return false // End of file
            }
        }) { }

        // Compute the MD5 digest:
        var digest: [UInt8] = Array(repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
        _ = CC_MD5_Final(&digest, &context)

        return Data(digest)

    } catch {
        print("Cannot open file:", error.localizedDescription)
        return nil
    }
}

The autorelease pool is needed to release the memory returned by
file.readData(), without it the entire (potentially huge) file
would be loaded into memory. Thanks to Abhi Beckert for noticing that
and providing an implementation.

If you need the digest as a hex-encoded string then change the
return type to String? and replace

return digest

let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
return hexDigest

Compute the MD5 checksum for a file while writing the file, in C

This is certainly possible to do. Essentially, you initialise an MD5 calculation before you start writing. Then, whenever you write some data to disk, also pass that to the MD5 update function. After writing all the data, you call a final MD5 function to compute the final digest.

If you don't have any MD5 code handy, RFC 1321 has an MD5 reference implementation included that provides the above operations.

Calculate md5 on a single 1T file, or on 100 10G files, which one is faster? Or the speed are the same?

As I was trying to say in the comments, it will depend on lots of things like the speed of your disk subsystem, your CPU performance and so on.

Here is an example. Create a 120GB file and check its size:

dd if=/dev/random of=junk bs=1g count=120

ls -lh junk
-rw-r--r--  1 mark  staff   120G  5 Oct 13:34 junk

Checksum in one go:

time md5sum junk
3c8fb0d5397be5a8b996239f1f5ce2f0  junk

real    3m55.713s       <--- 4 minutes
user    3m28.441s
sys     0m24.871s

Checksum in 10GB chunks, with 12 CPU cores in parallel:

time parallel -k --pipepart --recend '' --recstart '' --block 10G -a junk md5sum
29010b411a251ff467a325bfbb665b0d  -
793f02bb52407415b2bfb752827e3845  -
bf8f724d63f972251c2973c5bc73b68f  -
d227dcb00f981012527fdfe12b0a9e0e  -
5d16440053f78a56f6233b1a6849bb8a  -
dacb9fb1ef2b564e9f6373a4c2a90219  -
ba40d6e7d6a32e03fabb61bb0d21843a  -
5a5ee62d91266d9a02a37b59c3e2d581  -
95463c030b73c61d8d4f0e9c5be645de  -
4bcd7d43849b65d98d9619df27c37679  -
92bc1f80d35596191d915af907f4d951  -
44f3cb8a0196ce37c323e8c6215c7771  -

real    1m0.046s      <--- 1 minute
user    4m51.073s
sys     3m51.335s

It takes 1/4 of the time on my machine, but your mileage will vary... depending on your disk subsystem, your CPU etc.

Optimize: Calculating MD5 hash of large number of files recursively under a root folder

The stream returned by Files.walk(Path.of(rootDir), depth) cannot be parallelized effeciently (he has no size so it's difficult to determine slice to parallelize).
In your case for improving performance you need to collect first the result of Files.walk(...).

So your have to do:

Files.walk(Path.of(rootDir), depth)
        .filter(path -> !Files.isDirectory(path)) // skip directories
        .collect(Collectors.toList())
        .stream()
        .parallel() // in my computer divide the time needed by 5 (8 core cpu and SSD disk)
        .map(FileHash::getHash)
        .collect(Collectors.toList());

MD5 of Data in Swift 3

    CC_MD5(data.bytes, CC_LONG(data.count), &digest)

As noted, bytes is unavailable because it's dangerous. It's a raw pointer into memory than can vanish. The recommended solution is to use withUnsafeBytes which promises that the target cannot vanish during the scope of the pointer. From memory, it would look something like this:

data.withUnsafeBytes { bytes in
    CC_MD5(bytes, CC_LONG(data.count), &digest)
}

The point is that the bytes pointer can't escape into scopes where data is no longer valid.

For an example of this with CCHmac, which is pretty similar to MD5, see RNCryptor.

Possible to calculate MD5 (or other) hash with buffered reads?

You use the TransformBlock and TransformFinalBlock methods to process the data in chunks.

// Init
MD5 md5 = MD5.Create();
int offset = 0;

// For each block:
offset += md5.TransformBlock(block, 0, block.Length, block, 0);

// For last block:
md5.TransformFinalBlock(block, 0, block.Length);

// Get the has code
byte[] hash = md5.Hash;

Note: It works (at least with the MD5 provider) to send all blocks to TransformBlock and then send an empty block to TransformFinalBlock to finalise the process.

How can I hash a file on iOS using swift 3?

Create a cryptographic hash of each file and you can use that for uniqueness comparisons. SHA-256 is a current hash function and on iOS with Common Crypto is quite fast, on an iPhone 6S SHA256 will process about 1GB/second minus the I/O time. If you need fewer bytes just truncate the hash.

An example using Common Crypto (Swift3)

For hashing a string:

func sha256(string: String) -> Data {
    let messageData = string.data(using:String.Encoding.utf8)!
    var digestData = Data(count: Int(CC_SHA256_DIGEST_LENGTH))

    _ = digestData.withUnsafeMutableBytes {digestBytes in
        messageData.withUnsafeBytes {messageBytes in
            CC_SHA256(messageBytes, CC_LONG(messageData.count), digestBytes)
        }
    }
    return digestData
}
let testString = "testString"
let testHash = sha256(string:testString)
print("testHash: \(testHash.map { String(format: "%02hhx", $0) }.joined())")

let testHashBase64 = testHash.base64EncodedString()
print("testHashBase64: \(testHashBase64)")

Output:

testHash: 4acf0b39d9c4766709a3689f553ac01ab550545ffa4544dfc0b2cea82fba02a3

testHashBase64: Ss8LOdnEdmcJo2ifVTrAGrVQVF/6RUTfwLLOqC+6AqM=

Note: Add to your Bridging Header:

#import <CommonCrypto/CommonCrypto.h>

For hashing data:

func sha256(data: Data) -> Data {
    var digestData = Data(count: Int(CC_SHA256_DIGEST_LENGTH))

    _ = digestData.withUnsafeMutableBytes {digestBytes in
        data.withUnsafeBytes {messageBytes in
            CC_SHA256(messageBytes, CC_LONG(data.count), digestBytes)
        }
    }
    return digestData
}

let testData: Data = "testString".data(using: .utf8)!
print("testData: \(testData.map { String(format: "%02hhx", $0) }.joined())")
let testHash = sha256(data:testData)
print("testHash: \(testHash.map { String(format: "%02hhx", $0) }.joined())")

Output:

testData: 74657374537472696e67

testHash: 4acf0b39d9c4766709a3689f553ac01ab550545ffa4544dfc0b2cea82fba02a3

Also see Martin's link.

Swift Calculate Md5 Checksum for Large Files