Get Md5 Hash of Big Files in Python

Get the MD5 hash of big files in Python

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes

How to hash a big file without having to manually process chunks of data?

Are you sure you actually need to optimize this? I did some profiling, and on my computer there's not a lot to gain when the chunksize is not ridiculously small:

import os
import timeit

filename = "large.txt"
with open(filename, 'w') as f:
f.write('x' * 100*1000*1000) # Create 100 MB file

setup = '''
import hashlib

def md5(filename, chunksize):
m = hashlib.md5()
with open(filename, 'rb') as f:
while chunk := f.read(chunksize):
m.update(chunk)
return m.hexdigest()
'''

for i in range(16):
chunksize = 32 * 2**i
print('chunksize:', chunksize)
print(timeit.Timer(f'md5("{filename}", {chunksize})', setup=setup).repeat(2, 2))

os.remove(filename)

which prints:

chunksize: 32
[1.3256129720248282, 1.2988303459715098]
chunksize: 64
[0.7864588440279476, 0.7887071970035322]
chunksize: 128
[0.5426529520191252, 0.5496777250082232]
chunksize: 256
[0.43311091500800103, 0.43472746800398454]
chunksize: 512
[0.36928231100318953, 0.37598425400210544]
chunksize: 1024
[0.34912850096588954, 0.35173907200805843]
chunksize: 2048
[0.33507052797358483, 0.33372197503922507]
chunksize: 4096
[0.3222631579847075, 0.3201586640207097]
chunksize: 8192
[0.33291386102791876, 0.31049903703387827]
chunksize: 16384
[0.3095061599742621, 0.3061956529854797]
chunksize: 32768
[0.3073280190001242, 0.30928074003895745]
chunksize: 65536
[0.30916607001563534, 0.3033451830269769]
chunksize: 131072
[0.3083479679771699, 0.3039141249610111]
chunksize: 262144
[0.3087183449533768, 0.30319386802148074]
chunksize: 524288
[0.29915712698129937, 0.29429047100711614]
chunksize: 1048576
[0.2932401319849305, 0.28639856696827337]

This suggests that you can just chose a large, but not insane, chunksize. e.g. 1 MB.

Calculate MD5 and SHA1 simultaneously on large file

You already have the hardest part done. You just have to feed the chunks you read to another hasher:

def calculate_hashes(fname):
hash_md5 = hashlib.md5()
hash_sha1 = hashlib.sha1()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(2 ** 20), b""):
hash_md5.update(chunk)
hash_sha1.update(chunk)
return hash_md5.hexdigest(), hash_sha1.hexdigest()

Generating an MD5 checksum of a file

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

How to grab all files in a folder and get their MD5 hash in python?

glob.glob returns a list of files. Just iterate over the list using for:

import glob
import hashlib

filenames = glob.glob("/root/PycharmProjects/untitled1/*.exe")

for filename in filenames:
with open(filename, 'rb') as inputfile:
data = inputfile.read()
print(filename, hashlib.md5(data).hexdigest())

Notice that this can potentially exhaust your memory if you happen to have a large file in that directory, so it is better to read the file in smaller chunks (adapted here for 1 MiB blocks):

def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(2 ** 20), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()

for filename in filenames:
print(filename, md5(filename))

Python Compare local and remote file MD5 Hash

Ok looks like i found a solution so i will post it here :)

First you need to edit an .htaccess file to the directory where your files are on your server.

Content of the .htaccess file :

ContentDigest On

Now that you have set up this the server should send Content-MD5 data in HTTP header.

It will result in something like :

'Content-MD5': '7dVTxeHRktvI0Wh/7/4ZOQ=='

Ok now let see Python part, so i modified my code to be able to compare this HTTP header data and local md5 Checksum.

#-*- coding: utf-8 -*-

import hashlib
import requests
import base64

def md5Checksum(filePath,url):
m = hashlib.md5()
if url==None:
with open(filePath, u'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
#Get BASE 64 Local File md5
return base64.b64encode(m.digest()).decode('ascii')#Encode MD5 digest to BASE 64

else:
#Get BASE 64 Remote File md5
r = requests.head(url) #You read HTTP Header here
return r.headers['Content-MD5'] #Take only Content-MD5 string

def compare():
local = md5Checksum("projectg502th.pak.zip",None)
remote = md5Checksum(None,"http://127.0.0.1/md5/projectg502th.pak.zip")

if local == remote :
print("The soft don't download the file")
else:
print("The soft download the file")

print ("checksum_local :",md5Checksum("projectg_ziinf.pak.zip",None))
print ("checksum_remote : ",md5Checksum(None,"http://127.0.0.1/md5/projectg_ziinf.pak.zip"))

compare()

Output :

checksum_local : 7dVTxeHRktvI0Wh/7/4ZOQ==
checksum_remote : 7dVTxeHRktvI0Wh/7/4ZOQ==
The soft don't download the file

I hope this will help ;)



Related Topics



Leave a reply



Submit