How to Determine a Tar Archive's Format

How to determine a tar archive's format

You can use file under Linux to look at the fingerprint of the uncompressed archive:

$ touch foo                                 # create test file
$ tar --format=posix -cf posix.tar foo      # create test posix archive
$ tar --format=gnu   -cf gnu.tar   foo      # create test gnu archive
$ file posix.tar gnu.tar
posix.tar: POSIX tar archive
gnu.tar:   POSIX tar archive (GNU)

If the archive is compressed, decompress it first, because file won't peer beyond the compression layer:

$ touch foo                                 # create test file
$ tar --format=posix -czf posix.tar.gz foo  # create test gzip posix archive
$ tar --format=gnu   -czf gnu.tar.gz   foo  # create test gzip gnu archive
$ file posix.tar.gz gnu.tar.gz              # show output when compressed
posix.tar.gz: gzip compressed data
gnu.tar.gz:   gzip compressed data
$ gunzip posix.tar.gz                       # decompress to posix.tar
$ gunzip gnu.tar.gz                         # decompress to gnu.tar
$ file posix.tar gnu.tar                    # show output after decompression
posix.tar: POSIX tar archive
gnu.tar:   POSIX tar archive (GNU)

Or, check the compressed archives without saving the decompressed file by piping the output directly to file's standard input:

$ gunzip --stdout posix.tar.gz | file -
/dev/stdin: POSIX tar archive
$ gunzip --stdout gnu.tar.gz | file -
/dev/stdin: POSIX tar archive (GNU)

GNU is based on an older POSIX format, so that is why it says it is both.

For the nitty gritty details, the format is described in the GNU tar manual here and more details here.

How to check whether a file is in tar format?

Check the magic bytes at offset 257. If they match "ustar" including the null terminator, the file is probably a tar.

See: http://www.gnu.org/software/tar/manual/html_node/Standard.html

/* tar Header Block, from POSIX 1003.1-1990.  */

/* POSIX header.  */

struct posix_header
{                              /* byte offset */
  char name[100];               /*   0 */
  char mode[8];                 /* 100 */
  char uid[8];                  /* 108 */
  char gid[8];                  /* 116 */
  char size[12];                /* 124 */
  char mtime[12];               /* 136 */
  char chksum[8];               /* 148 */
  char typeflag;                /* 156 */
  char linkname[100];           /* 157 */
  char magic[6];                /* 257 */
  char version[2];              /* 263 */
  char uname[32];               /* 265 */
  char gname[32];               /* 297 */
  char devmajor[8];             /* 329 */
  char devminor[8];             /* 337 */
  char prefix[155];             /* 345 */
                                /* 500 */
};

#define TMAGIC   "ustar"        /* ustar and a null */
#define TMAGLEN  6

TAR file format issue

In my opinion none of your examples is the correct one, at least not for the POSIX format.

As you can read here:

/* tar Header Block, from POSIX 1003.1-1990. */
/* POSIX header */

struct posix_header {   /* byte offset */
  char name[100];               /*   0 */
  char mode[8];                 /* 100 */
  char uid[8];                  /* 108 */
  char gid[8];                  /* 116 */
  char size[12];                /* 124 */
  char mtime[12];               /* 136 */
  char chksum[8];               /* 148 */
  char typeflag;                /* 156 */
  char linkname[100];           /* 157 */
  char magic[6];                /* 257 */
  char version[2];              /* 263 */
  char uname[32];               /* 265 */
  char gname[32];               /* 297 */
  char devmajor[8];             /* 329 */
  char devminor[8];             /* 337 */
  char prefix[155];             /* 345 */
};

#define TMAGIC   "ustar"        /* ustar and a null */
#define TMAGLEN  6
#define TVERSION "00"           /* 00 and no null */
#define TVERSLEN 2

The format of your first example (Scenario 1) seems to be matching with the old GNU header format:

/* OLDGNU_MAGIC uses both magic and version fields, which are contiguous.
   Found in an archive, it indicates an old GNU header format, which will be
   hopefully become obsolescent.  With OLDGNU_MAGIC, uname and gname are
   valid, though the header is not truly POSIX conforming */

#define OLDGNU_MAGIC "ustar  "  /* 7 chars and a null */

In both your second and third examples (Scenario 2 and Scenario 3), the version field is set to an unexpected value (according to the above documentation, the correct value should be 00 ASCII or 0x30 0x30 hex), so this field is most likely ignored.

How to check if a Unix .tar.gz file is a valid file without uncompressing?

What about just getting a listing of the tarball and throw away the output, rather than decompressing the file?

tar -tzf my_tar.tar.gz >/dev/null

Edited as per comment. Thanks zrajm!

Edit as per comment. Thanks Frozen Flame! This test in no way implies integrity of the data. Because it was designed as a tape archival utility most implementations of tar will allow multiple copies of the same file!

How to check if file is tar file in Bash shell?

file command can determine file type:

file my.tar

if it is a tar file it will output:

my.tar: POSIX tar archive (GNU)

Then you can use grep to check the output (whether or not contains tar archive):

file my.tar | grep -q 'tar archive; && echo "I'm tar" || echo "I'm not tar"

In case the file does not exis, file output will be (with exit code 0):

do-not-exist.txt: cannot open `do-not-exist.txt' (No such file or directory).

You could use a case statement to handle several types of files.

How to determine if data is valid tar file without a file?

Say your uploaded data is contained in string data.

from tarfile import TarFile, TarError
from StringIO import StringIO

sio = StringIO(data)
try:
    tf = TarFile(fileobj=sio)
    # process the file....
except TarError:
    print "Not a tar file"

There are additional complexities such as handling different tar file formats and compression. More info is available in the tarfile documentation.

Why does GNU tar --format=pax produce ustar archives?

pax Interchange Format:

A pax archive tape or file produced in the -x pax format shall contain
a series of blocks. The physical layout of the archive shall be
identical to the ustar format described in ustar Interchange Format.

"ustar" followed by 1 zero/NUL byte is the value of the magic field indicating the type of the archive:

The magic field is the specification that this archive was output in
this archive format. If this field contains ustar (the five
characters from the ISO/IEC 646:1991 standard IRV shown followed by
NUL), …

Of course, that's only for any conforming pax utility, but I'd expect pax format archives created by GNU tar to create archives in the same way as a conforming pax implementation.

Read .tar entries in a specific order (C#, SharpLibZip)

You would need to write your own tar decoder. It is up to you to say if you would consider this to be "easy" or not. The tar format is pretty simple.

You would need to first scan through the tar file to find all the headers, saving the file name and the offset and length of the file data for each. Then you could seek back and forth to the offset of any file to read its contents.

This would be much more difficult if the tar file were compressed, e.g. if it were a .tar.gz file, as opposed to a .tar file.

The tar format is documented here.

Update:

In a comment, the OP revealed that it is actually a .tar.bz2 file. As noted, that requires additional work to be able to randomly access entries. In addition to building an index to the tar contents, the entire .bz2 file needs to be read to build an index to the compression entry points, which do not correspond to where files start in the tar archive. Then to access a file you first would go to the closest bzip2 entry point that precedes the start of that file data, and decompress from there until you arrive at and then read out that data.

It would be easier to simply rearchive and recompress the files into the zip format, which is designed to randomly access and extract individual entries.

tar: Unrecognized archive format error when trying to unpack flower_photos.tgz, TF tutorials on OSX

Apparently the new instructions on TensorFlow website run without issues

I just tried the instructions posted on How to Retrain Inception's Final Layer for New Categories

curl -O http://download.tensorflow.org/example_images/flower_photos.tgz

tar xzf flower_photos.tgz

It worked without any problems

How to extract filename.tar.gz file

If file filename.tar.gz gives this message: POSIX tar archive,
the archive is a tar, not a GZip archive.

Unpack a tar without the z, it is for gzipped (compressed), only:

mv filename.tar.gz filename.tar # optional
tar xvf filename.tar

Or try a generic Unpacker like unp (https://packages.qa.debian.org/u/unp.html), a script for unpacking a wide variety of archive formats.

determine the file type:

$ file ~/Downloads/filename.tbz2
/User/Name/Downloads/filename.tbz2: bzip2 compressed data, block size = 400k

How to Determine a Tar Archive's Format