Opening a .Tar.Gz File with a Single Command

Opening a .tar.gz file with a single command

tar xzf file.tar.gz

The letters are:

x - extract
z - gunzip the input
f - Read from a file, not stdin

R: Read single file from within a tar.gz directory

It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.

The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.

ParseTGZ<- function(archname){
  # open tgz archive
  tf <- gzfile(archname, open='rb')
  on.exit(close(tf))
  fnames <- list()
  offset <- 0
  nfile <- 0
  while (TRUE) {
    # go to beginning of entry
    # never use "seek" to re-locate in a gzipped file!
    if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
    # read file name
    fName <- rawToChar(readBin(tf, what="raw", n=100))
    if (nchar(fName)==0) break
    nfile <- nfile + 1
    fnames <- c(fnames, fName)
    attr(fnames[[nfile]], "offset") <- offset+512
    # read size, first skip 24 bytes (file permissions etc)
    # again, we only use readBin, not seek()
    readBin(tf, what="raw", n=24)
    # file size is encoded as a length 12 octal string, 
    # with the last character being '\0' (so 11 actual characters)
    sz <- readChar(tf, nchars=11) 
    # convert string to number of bytes
    sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
    attr(fnames[[nfile]], "size") <- sz
#    cat(sprintf('entry %s, %i bytes\n', fName, sz))
    # go to the next message
    # don't forget entry header (=512) 
    offset <- offset + 512*(ceiling(sz/512) + 1)
  }
# return a named list of characters strings with attributes?
  names(fnames) <- fnames
  return(fnames)
}

This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.

extractTGZ <- function(archfile, filename) {
  # this function returns a raw vector
  # containing the desired file
  fp <- ParseTGZ(archfile)
  offset <- attributes(fp[[filename]])$offset
  fsize <- attributes(fp[[filename]])$size
  gzf <- gzfile(archfile, open="rb")
  on.exit(close(gzf))
  # jump to the byte position, don't use seek()
  # may be a bad idea on really large archives...
  readBin(gzf, what="raw", n=offset)
  # now read the data into a raw vector
  result <- readBin(gzf, what="raw", n=fsize)
  result
}

now, finally:

ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))

Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.

Unzip tar.gz in Windows

7 zip can do that: http://www.7-zip.org/

It has a documented command line. I use it every day via scripts.

Plus: it is free and has 32 and 64 bit versions.

Programmatically extract tar.gz in a single step (on Windows with 7-Zip)

7z e example.tar.gz  && 7z x example.tar

Use && to combine two commands in one step. Use the 7-Zip portable (you will need 7z.exe and 7z.dll only).

Extract .tar.gz file on Windows

tar is a Linux archive utility. 7-zip, mentioned here, will unpack that tarball.

As cricket_007 points out, 39 years after AT&T Bell Labs compiled tar, Microsoft used someone else's code to include tar/curl with Windows 10.

unzip a tar.gz file?

fn <- "http://s.wordpress.org/resources/survey/wp2011-survey.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
untar("tmp.tar.gz",list=TRUE)  ## check contents
untar("tmp.tar.gz")
## or, if you just want to extract the target file:
untar("tmp.tar.gz",files="wp2011-survey/anon-data.csv")
X <- read.csv("wp2011-survey/anon-data.csv")

Tom Wenseleers points out that the archive package can help with this:

library(archive)
library(readr)
read_csv(archive_read("tmp.tar.gz", file = 3), col_types = cols())

and that archive::archive_extract("tmp.tar.gz", files="wp2011-survey/anon-data.csv") is quite a bit faster than the in-built base R untar (especially for large archives) It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

Opening a .Tar.Gz File with a Single Command