Download and Write .Tar.Gz Files Without Corruption

Download and write .tar.gz files without corruption

I've successfully downloaded and extracted GZip files with this code:

require 'open-uri'
require 'zlib'

open('tarball.tar', 'w') do |local_file|
open('http://github.com/jashkenas/coffee-script/tarball/master/tarball.tar.gz') do |remote_file|
local_file.write(Zlib::GzipReader.new(remote_file).read)
end
end

The tar.gz file transfer via SCP client cause corruption of file

I believe your code does not wait for the tar to complete. So you are downloading an incomplete file.

See Wait until task is completed on Remote Machine through Python.

Try this:

stdin, stdout, stderr = client.exec_command(command)
print('Compression started')
stdout.channel.recv_exit_status() # Wait for tar to complete
print('Compression is done. Downloading files from {} to {}'.format(tar_save_path, save_path))

How can I check if .tar (not tar.gz) file is corrupt or not, in Ubuntu?

When asking a question regarding Linux, try to state the distro you are encountering the error on. It might help others better understand the problem and in turn, to help you better.

OK, now into the answer.

You can use 7zip to easily test whether an archive is corrupted or not. If 7zip, which is distributed via the p7zip package isn't already installed, install it with one of the following commands matching your Linux distribution.

Ubuntu: sudo apt install p7zip

Fedora/Cent OS/RHEL: sudo yum install p7zip

Arch/Manjaro: sudo pacman -S p7zip

To check whether it's been successfully installed, run 7z. You should see an output similar to this:

7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Usage: 7z <command> [<switches>...] <archive_name> [<file_names>...]

<Commands>
a : Add files to archive
b : Benchmark
d : Delete files from archive
e : Extract files from archive (without using directory names)
h : Calculate hash values for files
i : Show information about supported formats
l : List contents of archive
rn : Rename files in archive
t : Test integrity of archive
u : Update files to archive
x : eXtract files with full paths

<Switches>
-- : Stop switches parsing
@listfile : set path to listfile that contains file names
-ai[r[-|0]]{@listfile|!wildcard} : Include archives
-ax[r[-|0]]{@listfile|!wildcard} : eXclude archives
-ao{a|s|t|u} : set Overwrite mode
-an : disable archive_name field
-bb[0-3] : set output log level
-bd : disable progress indicator
-bs{o|e|p}{0|1|2} : set output stream for output/error/progress line
-bt : show execution time statistics
-i[r[-|0]]{@listfile|!wildcard} : Include filenames
-m{Parameters} : set compression Method
-mmt[N] : set number of CPU threads
-mx[N] : set compression level: -mx1 (fastest) ... -mx9 (ultra)
-o{Directory} : set Output directory
-p{Password} : set Password
-r[-|0] : Recurse subdirectories
-sa{a|e|s} : set Archive name mode
-scc{UTF-8|WIN|DOS} : set charset for for console input/output
-scs{UTF-8|UTF-16LE|UTF-16BE|WIN|DOS|{id}} : set charset for list files
-scrc[CRC32|CRC64|SHA1|SHA256|*] : set hash function for x, e, h commands
-sdel : delete files after compression
-seml[.] : send archive by email
-sfx[{name}] : Create SFX archive
-si[{name}] : read data from stdin
-slp : set Large Pages mode
-slt : show technical information for l (List) command
-snh : store hard links as links
-snl : store symbolic links as links
-sni : store NT security information
-sns[-] : store NTFS alternate streams
-so : write data to stdout
-spd : disable wildcard matching for file names
-spe : eliminate duplication of root folder for extract command
-spf : use fully qualified file paths
-ssc[-] : set sensitive case mode
-ssw : compress shared files
-stl : set archive timestamp from the most recently modified file
-stm{HexMask} : set CPU thread affinity mask (hexadecimal number)
-stx{Type} : exclude archive type
-t{Type} : Set type of archive
-u[-][p#][q#][r#][x#][y#][z#][!newArchiveName] : Update options
-v{Size}[b|k|m|g] : Create volumes
-w[{path}] : assign Work directory. Empty path means a temporary directory
-x[r[-|0]]{@listfile|!wildcard} : eXclude filenames
-y : assume Yes on all queries

Now, as you can see, you can run the following command to test if a particular archive is corrupted or not. Replace archive_name with the filename of .tar file you want to test.

7z t archive_name.tar

In my case, an archive, test.tar contains 2 files and is healthy. And following is the output I received.

7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Scanning the drive for archives:
1 file, 10240 bytes (10 KiB)

Testing archive: test.tar
--
Path = test.tar
Type = tar
Physical Size = 10240
Headers Size = 2048
Code Page = UTF-8

Everything is Ok

Files: 2
Size: 7980
Compressed: 10240

You can see that "Everything is Ok" has been outputted, which means the archive is healthy.


Now I've intentionally made that same archive corrupt. Let's see what 7z t test.tar outputs now.

7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs x64)

Scanning the drive for archives:
1 file, 10224 bytes (10 KiB)

Testing archive: test1.tar
ERROR: test1.tar
test1.tar
Open ERROR: Can not open the file as [tar] archive

ERRORS:
Is not archive

Can't open as archive: 1
Files: 0
Size: 0
Compressed: 0

You can see that it outputs various messages like "Can not open the file as [tar] archive." and "Is not archive". You might get similar or same messages if your archive is corrupted.

Php download file corruption

It worked when I moved the code to a separate .php file

Skip broken archives (.tar.gz) when using 'tarfile'

If I understand your question correctly, you might be looking for modification like this one:

import os
import tarfile
files = os.listdir('G:\\A')
for file in files:
id = file.split('.')
try:
with tarfile.open('G:\\A\\' + file,'r:gz') as tar:
tar.extractall(path='G:\\A\\Extracted\\' + id[0])
except tarfile.ReadError: # reading tarfile failed
continue # move on to the next one

Not sure how are your files corrupted and what sort of error would you see, so you may need to catch a different exception.

open corrupt tar file with python

Reliable downloads of large files over bad connections is not easy. If http range requests are supported then you can resume the download on broken connections.

A good start is to use the requests library and read the remote file as a stream.
However disconnects and resumes might still have to be handled by you.

See this question for how to use that API

But please make sure that the file is indeed a tar. You can use libmagic for file format detection.

That file extension suggests a gzip not a tar.

import gzip
f = gzip.open('h5.gz', 'rb')
file_content = f.read()
f.close()

How to check if a Unix .tar.gz file is a valid file without uncompressing?

What about just getting a listing of the tarball and throw away the output, rather than decompressing the file?

tar -tzf my_tar.tar.gz >/dev/null

Edited as per comment. Thanks zrajm!

Edit as per comment. Thanks Frozen Flame! This test in no way implies integrity of the data. Because it was designed as a tape archival utility most implementations of tar will allow multiple copies of the same file!



Related Topics



Leave a reply



Submit