Why Do the Md5 Hashes of Two Tarballs of the Same File Differ

Why do the md5 hashes of two tarballs of the same file differ?

tar czf outfile infiles is equivalent to

tar cf - infiles | gzip > outfile

The reason the files are different is because gzip puts its input filename and modification time into the compressed file. When the input is a pipe, it uses an empty string as the filename and the current time as the modification time.

But it also has a --no-name option, which tells it not to put the name and timestamp into the file. So if you write the expanded command explicitly, instead of using the -z option to tar, you can make use of this option.

tar cf - testfile | gzip --no-name > a.tar.gz
tar cf - testfile | gzip --no-name > b.tar.gz

I tested this on OS X 10.6.8 and it works.

tar package has different checksum for exactly the same content

The archives you provided contain pax extended headers.
A quick glance at their structure reveals that they differ in these two fields:

  1. The process ID of the pax process (as part of a name for the extended header in the ustar header block, and consequently the checksum for this ustar header block).
  2. The atime (access time) in the extended header.

One of the workarounds you can use for reproducible archive creation is to enforce the old unix ustar format (rather than the pax/posix format):

tar --format=ustar -cf package.tar folder

The other choice is to manually set the extended name and delete the atime while preserving the pax format:

tar --format=pax --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime -cf package.tar folder

Now the md5sum should be the same for both archives.

Why two results of 'tar cvf' of a same project not matched?

When you get hash of hello it get the content and calculate the hash.

But when you get hello.tar it include beside the hello file kind of header which include variables like modification time of file hello. And those timestamps change the content (and hash) of hello.tar

For reference of the tar header you can check here.

How to compare two tarball's content

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

how to create archive whose keep same md5 hash for identical content in python?

As a workaround you can use the bzip2 compression instead. It does not seem to have this problem:

import tarfile

tar1 = tarfile.open("one.tar.bz2", "w:bz2")
tar1.add("bin")
tar1.close()

tar2 = tarfile.open("two.tar.bz2", "w:bz2")
tar2.add("bin")
tar2.close()

Running the md5 gives:

martin@martin-UX305UA:~/test$ md5sum one.tar.bz2 two.tar.bz2 
e9ec2fd4fbdfae465d43b2f5ecaecd2f one.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f two.tar.bz2

is it normal to have different md5 for apache maven

The file must have been corrupted during download. You have to download it again. That is of course assuming you are certain the md5 hash corresponds to that file.

A cryptographic hash of a given piece of data is unique. You cannot have two different hashes for the same piece of data. However, you could have the same hash for two different pieces of data, although that is quite unlikely. That is called a hash collision.

TAR.GZ file shows in 'git status' after tar.gz

After much consideration I have decided to do a comparison on the old and new file to ensure that the new file is only used once it's confirmed that the files are in fact different. I'm working to test this more fully but think that this may "fix" the issue.

def cleanPath(path):
path = path.replace('/','\\')
return path

def untar(tar_path):
print(f'Trying to unzip... {tar_path}')
try:
tar_path = cleanPath(tar_path)
# path_date = tar_path.split("\\")[-1].replace(".tar.gz","")
path_date = tar_path.replace('.tar.gz','')
print(path_date)
# open file
file = tarfile.open(tar_path)

# extracting file
file.extractall(path_date)

file.close()
except Exception as ex:
print("Unzip Failure")
print(ex)
return False
return True

def tar(tar_path):
print(f'Trying to zip...{tar_path}')
# TAR the file
try:
print("tar_path")
print(tar_path)
old_file_path = f"{tar_path}_new.tar.gz"
tar_file = f"{tar_path}_new.tar.gz"
with tarfile.open(tar_file,'w:gz') as tar_handle:
for r,d,f in os.walk(tar_path):
for gz_file in f:
# tar_handle.add(file)
tar_handle.add(os.path.join(r,gz_file),gz_file)
try:
shutil.rmtree(tar_path)
except Exception as ex:
print(ex)
except Exception as ex:
print("Zip Failure")
print(ex)
return False
# Check the MD5 of each side, leave the old file if possible
try:
old_hash = hashlib.md5(open(old_file_path,'rb').read()).hexdigest()
except Exception as ex:
print(ex)
try:
new_hash = hashlib.md5(open(tar_file,'rb').read()).hexdigest()
except Exception as ex:
print(ex)

if new_hash == old_hash:
# leave the old file
os.remove(tar_file)
else:
# remove the old file
try:
os.remove(old_file_path)
except Exception as ex:
print(ex)
# rename the new file to the old file name
os.rename(tar_file,old_file_path)
return True

Python tarfile and zipfile producing archives with different MD5 for 2 identical files

Whilst the payload of the two archives may be identical, the underlying structure of the archives is different, and compression only adds to those differences.

Zip and Tar are both archiving formats, and they can both be combined with compression; more often than not, they are. The combinations of differing compression algorithms and fundamentally different underlying format structure will result in different MD5s.

--

In this case, the last modification time and names of the underlying files are different, even though the contents of the files are the same; this results in a different MD5.

Why do downloads for various projects have hashcodes or checksums?

When downloading files where integrity is critical (an iso of a linux distribution for example) i tend to md5sum the download just in case.

The source may be trusted, but you never know when your own NIC's hardware may start to malfunction.



Related Topics



Leave a reply



Submit