Opening a .Tar.Gz File with a Single Command

Opening a .tar.gz file with a single command


tar xzf file.tar.gz

The letters are:

  • x - extract
  • z - gunzip the input
  • f - Read from a file, not stdin

R: Read single file from within a tar.gz directory

It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.

The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.

ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}

This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.

extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}

now, finally:

ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))

Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.

Unzip tar.gz in Windows

7 zip can do that: http://www.7-zip.org/

It has a documented command line. I use it every day via scripts.

Plus: it is free and has 32 and 64 bit versions.

Programmatically extract tar.gz in a single step (on Windows with 7-Zip)


7z e example.tar.gz  && 7z x example.tar

Use && to combine two commands in one step. Use the 7-Zip portable (you will need 7z.exe and 7z.dll only).

Extract .tar.gz file on Windows

tar is a Linux archive utility. 7-zip, mentioned here, will unpack that tarball.

As cricket_007 points out, 39 years after AT&T Bell Labs compiled tar, Microsoft used someone else's code to include tar/curl with Windows 10.

unzip a tar.gz file?


fn <- "http://s.wordpress.org/resources/survey/wp2011-survey.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
untar("tmp.tar.gz",list=TRUE) ## check contents
untar("tmp.tar.gz")
## or, if you just want to extract the target file:
untar("tmp.tar.gz",files="wp2011-survey/anon-data.csv")
X <- read.csv("wp2011-survey/anon-data.csv")

Tom Wenseleers points out that the archive package can help with this:

library(archive)
library(readr)
read_csv(archive_read("tmp.tar.gz", file = 3), col_types = cols())

and that archive::archive_extract("tmp.tar.gz", files="wp2011-survey/anon-data.csv") is quite a bit faster than the in-built base R untar (especially for large archives) It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.



Related Topics



Leave a reply



Submit