Hashing a File in Python

Python 3.7 : Hashing a binary file

There are a couple of things you can tweak here.

You don't need to decode the bytes returned by .read(), because md5() is expecting bytes in the first place, not str:

>>> import hashlib
>>> h = hashlib.md5(open('dump.rdb', 'rb').read()).hexdigest()
>>> h
'9a7bf9d3fd725e8b26eee3c31025b18e'

This means you can remove the line buffer = buffer.decode('UTF-8') from your function.

You'll also need to return hash if you want to use the results of the function.

Lastly, you need to pass the raw block of bytes to .update(), not its hex digest (which is a str); see the docs' example.

Putting it all together:

def hash_file(filename: str, blocksize: int = 4096) -> str:
    hsh = hashlib.md5()
    with open(filename, "rb") as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            hsh.update(buf)
    return hsh.hexdigest()

(The above is an example using a Redis .rdb dump binary file.)

How to hash a big file without having to manually process chunks of data?

Are you sure you actually need to optimize this? I did some profiling, and on my computer there's not a lot to gain when the chunksize is not ridiculously small:

import os
import timeit

filename = "large.txt"
with open(filename, 'w') as f:
    f.write('x' * 100*1000*1000)  # Create 100 MB file

setup = '''
import hashlib

def md5(filename, chunksize):
    m = hashlib.md5()
    with open(filename, 'rb') as f:
        while chunk := f.read(chunksize):
            m.update(chunk)
    return m.hexdigest()
'''

for i in range(16):
    chunksize = 32 * 2**i
    print('chunksize:', chunksize)
    print(timeit.Timer(f'md5("{filename}", {chunksize})', setup=setup).repeat(2, 2))

os.remove(filename)

which prints:

chunksize: 32
[1.3256129720248282, 1.2988303459715098]
chunksize: 64
[0.7864588440279476, 0.7887071970035322]
chunksize: 128
[0.5426529520191252, 0.5496777250082232]
chunksize: 256
[0.43311091500800103, 0.43472746800398454]
chunksize: 512
[0.36928231100318953, 0.37598425400210544]
chunksize: 1024
[0.34912850096588954, 0.35173907200805843]
chunksize: 2048
[0.33507052797358483, 0.33372197503922507]
chunksize: 4096
[0.3222631579847075, 0.3201586640207097]
chunksize: 8192
[0.33291386102791876, 0.31049903703387827]
chunksize: 16384
[0.3095061599742621, 0.3061956529854797]
chunksize: 32768
[0.3073280190001242, 0.30928074003895745]
chunksize: 65536
[0.30916607001563534, 0.3033451830269769]
chunksize: 131072
[0.3083479679771699, 0.3039141249610111]
chunksize: 262144
[0.3087183449533768, 0.30319386802148074]
chunksize: 524288
[0.29915712698129937, 0.29429047100711614]
chunksize: 1048576
[0.2932401319849305, 0.28639856696827337]

This suggests that you can just chose a large, but not insane, chunksize. e.g. 1 MB.

Calculate and add hash to a file in python

from Crypto.Hash import HMAC
secret_key = "Don't tell anyone"
h = HMAC.new(secret_key)
text = "whatever you want in the file"
## or: text = open("your_file_without_hash_yet").read()
h.update(text)
with open("file_with_hash") as fh:
    fh.write(text)
    fh.write(h.hexdigest())

Now, as some people tried to point out, though they seemed confused - you need to remember that this file has the hash on the end of it and that the hash is itself not part of what gets hashed. So when you want to check the file, you would do something along the lines of:

end_len = len(h.hex_digest())
all_text = open("file_with_hash").read()
text, expected_hmac = all_text[:end_len], all_text[end_len:]
h = HMAC.new(secret_key)
h.update(text)
if h.hexdigest() != expected_hmac:
    raise "Somebody messed with your file!"

It should be clear though that this alone doesn't ensure your file hasn't been changed; the typical use case is to encrypt your file, but take the hash of the plaintext. That way, if someone changes the hash (at the end of the file) or tries changing any of the characters in the message (the encrypted portion), things will mismatch and you will know something was changed.

A malicious actor won't be able to change the file AND fix the hash to match because they would need to change some data, and then rehash everything with your private key. So long as no one knows your private key, they won't know how to recreate the correct hash.

How do I calculate the MD5 checksum of a file in Python?

In regards to your error and what's missing in your code. m is a name which is not defined for getmd5() function.

No offence, I know you are a beginner, but your code is all over the place. Let's look at your issues one by one :)

First, you are not using hashlib.md5.hexdigest() method correctly. Please refer explanation on hashlib functions in Python Doc Library. The correct way to return MD5 for provided string is to do something like this:

>>> import hashlib
>>> hashlib.md5("example string").hexdigest()
'2a53375ff139d9837e93a38a279d63e5'

However, you have a bigger problem here. You are calculating MD5 on a file name string, where in reality MD5 is calculated based on file contents. You will need to basically read file contents and pipe it though MD5. My next example is not very efficient, but something like this:

>>> import hashlib
>>> hashlib.md5(open('filename.exe','rb').read()).hexdigest()
'd41d8cd98f00b204e9800998ecf8427e'

As you can clearly see second MD5 hash is totally different from the first one. The reason for that is that we are pushing contents of the file through, not just file name.

A simple solution could be something like that:

# Import hashlib library (md5 method is part of it)
import hashlib

# File to check
file_name = 'filename.exe'

# Correct original md5 goes here
original_md5 = '5d41402abc4b2a76b9719d911017c592'  

# Open,close, read file and calculate MD5 on its contents 
with open(file_name, 'rb') as file_to_check:
    # read contents of the file
    data = file_to_check.read()    
    # pipe contents of the file through
    md5_returned = hashlib.md5(data).hexdigest()

# Finally compare original MD5 with freshly calculated
if original_md5 == md5_returned:
    print "MD5 verified."
else:
    print "MD5 verification failed!."

Please look at the post Python: Generating a MD5 checksum of a file. It explains in detail a couple of ways how it can be achieved efficiently.

Best of luck.

How to salt a generated hash from a file in Python

Try this instead:

SALT = "random string";
def sign(file):
    with open(private_key_path, 'rb') as f:
        key = f.read()
    hash_ = SHA256.new(file.read())
    hash_.update(SALT.encode())
    # do signing stuff
    return signature

According to the official hashlib documentation:

hash.update(data) updates the hash object with the bytes-like object (data).

This means that SHA256.new() actually creates a python object, and .update() is a method of this python object which updates the object's property. It doesn't return anything, and hence nothing will be stored in the hash variable of your 2nd code.

For more information, please take a look at this answer.

Hashing in .iso file in Python from user input; can hash string corresponding with directory, not actual file

I solved the problem in question by doing a couple of things:

I removed userInput.encode from result = hashlib.sha512(userInput.encode())

And I appended the following loop onto my script following isoFile = open(userInput, 'rb'):

while True:
    readIso = isoFile.read(1024)

    if readIso:
        result.update(readIso)
    else:
        hexHash = result.hexdigest()
        break