Opening a .tar.gz file with a single command
tar xzf file.tar.gz
The letters are:
- x - extract
- z - gunzip the input
- f - Read from a file, not stdin
R: Read single file from within a tar.gz directory
It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff
as if it were (a connection pointing to) your file. But it only exists in memory.
Unzip tar.gz in Windows
7 zip can do that: http://www.7-zip.org/
It has a documented command line. I use it every day via scripts.
Plus: it is free and has 32 and 64 bit versions.
Programmatically extract tar.gz in a single step (on Windows with 7-Zip)
7z e example.tar.gz && 7z x example.tar
Use &&
to combine two commands in one step. Use the 7-Zip portable (you will need 7z.exe and 7z.dll only).
Extract .tar.gz file on Windows
tar
is a Linux archive utility. 7-zip, mentioned here, will unpack that tarball.
As cricket_007 points out, 39 years after AT&T Bell Labs compiled tar, Microsoft used someone else's code to include tar/curl with Windows 10.
unzip a tar.gz file?
fn <- "http://s.wordpress.org/resources/survey/wp2011-survey.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
untar("tmp.tar.gz",list=TRUE) ## check contents
untar("tmp.tar.gz")
## or, if you just want to extract the target file:
untar("tmp.tar.gz",files="wp2011-survey/anon-data.csv")
X <- read.csv("wp2011-survey/anon-data.csv")
Tom Wenseleers points out that the archive
package can help with this:
library(archive)
library(readr)
read_csv(archive_read("tmp.tar.gz", file = 3), col_types = cols())
and that archive::archive_extract("tmp.tar.gz", files="wp2011-survey/anon-data.csv")
is quite a bit faster than the in-built base R untar (especially for large archives) It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.
Related Topics
Why Does Order in Which Input Libraries Are Specified Matter
How to Replace:Characters with Newline
Number of Processors/Cores in Command Line
Must My Pidfile Be Located in /Var/Run
How to Convert a PDF into Jpg with Command Line in Linux
Prevent * to Be Expanded in the Bash Script
Copy Differences Between Two Files in Unix
How to Ssh Multiple Hops Without Putting the Local Rsa Key Everywhere
Makefile Export .O File to a Different Path Than .Cpp
How to Loop Through the Coming Frequency of the Keyword
Is It Safe to Delete the Journal File of Mongodb
How to Find Substring Inside a String (Or How to Grep a Variable)
Extending a Script to Loop Over Multiple Files and Generate Output Names
How to Mount from Command Line Like the Nautilus Does
Genymotion Throws Libssl_Conf.So: Cannot Open Shared Object File: No Such File or Directory