extract bz2 file and open ncdf using R
This seems to work the way you want it to:
library(ncdf4)
library(R.utils)
URL <- "ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/NCDC/AVHRR_OI/1982/001/19820101-NCDC-L4LRblend-GLOB-v01-fv02_0-AVHRR_OI.nc.bz2"
bzfil <- basename(URL)
if (!file.exists(bzfil)) download.file(URL, bzfil)
fil <- bunzip2(bzfil, overwrite=TRUE, remove=FALSE)
nc <- nc_open(fil)
summary(nc)
## Length Class Mode
## filename 1 -none- character
## writable 1 -none- logical
## id 1 -none- numeric
## safemode 1 -none- logical
## format 1 -none- character
## is_GMT 1 -none- logical
## groups 1 -none- list
## fqgn2Rindex 1 -none- list
## ndims 1 -none- numeric
## natts 1 -none- numeric
## dim 3 -none- list
## unlimdimid 1 -none- numeric
## nvars 1 -none- numeric
## var 4 -none- list
How do I extract all the data from a bzip2 archive with C?
This is my source code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <bzlib.h>
int
bunzip_one(FILE *f) {
int bzError;
BZFILE *bzf;
char buf[4096];
bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
if (bzError != BZ_OK) {
fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
return -1;
}
while (bzError == BZ_OK) {
int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
size_t nwritten = fwrite(buf, 1, nread, stdout);
if (nwritten != (size_t) nread) {
fprintf(stderr, "E: short write\n");
return -1;
}
}
}
if (bzError != BZ_STREAM_END) {
fprintf(stderr, "E: bzip error after read: %d\n", bzError);
return -1;
}
BZ2_bzReadClose(&bzError, bzf);
return 0;
}
int
bunzip_many(const char *fname) {
FILE *f;
f = fopen(fname, "rb");
if (f == NULL) {
perror(fname);
return -1;
}
fseek(f, 0, SEEK_SET);
if (bunzip_one(f) == -1)
return -1;
fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
if (bunzip_one(f) == -1)
return -1;
fclose(f);
return 0;
}
int
main(int argc, char **argv) {
if (argc < 2) {
fprintf(stderr, "usage: bunz <fname>\n");
return EXIT_FAILURE;
}
return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}
- I cared very much about proper error checking. For example, I made sure that
bzError
wasBZ_OK
orBZ_STREAM_END
before trying to access the buffer. The documentation clearly says that for other values ofbzError
the returned number is undefined. - It shouldn't frighten you that about 50 percent of the code are concerned with error handling. That's how it should be. Expect errors everywhere.
- The code still has some bugs. In case of errors it doesn't release the resources (
f
,bzf
) properly.
And these are the commands I used for testing:
$ echo hello > hello
$ echo world > world
$ bzip2 hello
$ bzip2 world
$ cat hello.bz2 world.bz2 > helloworld.bz2
$ gcc -W -Wall -Os -o bunz bunz.c -lbz2
$ ls -l *.bz2
-rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
-rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
-rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
$ ./bunz.exe helloworld.bz2
hello
world
how to decompress .tar.bz2 in memory with python
For generic bz2 decompression, BZ2File
class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content
should contain the decompressed contents of the file.
However, given that this is a tar
file (an archive file that is normally extracted to disk as a directory of files), the tarfile
module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv
, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2
flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2
makes it impractical to call extract files from the members it return by extractfile
. The second line simply calls extractfile
to return the contents of 'res_test.csv'
from the archive file as a string.
The transparent open mode ('r:*'
) is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile
module has a lower level open
method which may be used on arbitrary stream objects. If the file was already opened using BZ2File
already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()
Using R to download zipped data file, extract, and import data
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip)
for details. So to do what you sketch out above you need to
- Create a temp. file name (eg
tempfile()
) - Use
download.file()
to fetch the file into the temp. file - Use
unz()
to extract the target file from temp. file - Remove the temp file via
unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z
) or gzipped (.gz
) or bzip2ed (.bz2
) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
julia: how to read a bz2 compressed text file
I don't know about anything automatic but this is how you could (create and) read a bz2 compressed file:
using CodecBzip2 # after ] add CodecBzip2
# Creating a dummy bz2 file
mystring = "Hello StackOverflow!"
mystring_compressed = transcode(Bzip2Compressor, mystring)
write("testfile.bz2", mystring_compressed)
# Reading and uncompressing it
compressed = read("testfile.bz2")
plain = transcode(Bzip2Decompressor, compressed)
String(plain) # "Hello StackOverflow!"
There are also streaming variants available. For more see CodecBzip2.jl.
Related Topics
How to Clean Up R Memory Without Restarting My Pc
Monitoring for Changes in File(S) in Real Time
Mgcv: How to Set Number And/Or Locations of Knots for Splines
Passing List of Named Parameters to Function
Figure Out What Version of R a Function Was Introduced In
Daily Time Series with Ts.. How to Specify Start and End
Merge Data Frames and Overwrite Values
Explanation of R: Options(Expressions=) to Non-Computer Scientists
How to Order Bars in Faceted Ggplot2 Bar Chart
Dplyr::N() Returns "Error: This Function Should Not Be Called Directly"
Plot Size and Resolution with R Markdown, Knitr, Pandoc, Beamer
What Algorithm I Need to Find N-Grams
Force Ggplot2 Scatter Plot to Be Square Shaped
Highlight All Connected Paths from Start to End in Sankey Graph Using R