Extract Bz2 File in R

extract bz2 file and open ncdf using R

This seems to work the way you want it to:

library(ncdf4)
library(R.utils)

URL <- "ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/NCDC/AVHRR_OI/1982/001/19820101-NCDC-L4LRblend-GLOB-v01-fv02_0-AVHRR_OI.nc.bz2"
bzfil <- basename(URL)
if (!file.exists(bzfil)) download.file(URL, bzfil)

fil <- bunzip2(bzfil, overwrite=TRUE, remove=FALSE)

nc <- nc_open(fil)
summary(nc)

## Length Class Mode
## filename 1 -none- character
## writable 1 -none- logical
## id 1 -none- numeric
## safemode 1 -none- logical
## format 1 -none- character
## is_GMT 1 -none- logical
## groups 1 -none- list
## fqgn2Rindex 1 -none- list
## ndims 1 -none- numeric
## natts 1 -none- numeric
## dim 3 -none- list
## unlimdimid 1 -none- numeric
## nvars 1 -none- numeric
## var 4 -none- list

How do I extract all the data from a bzip2 archive with C?

This is my source code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <bzlib.h>

int
bunzip_one(FILE *f) {
int bzError;
BZFILE *bzf;
char buf[4096];

bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
if (bzError != BZ_OK) {
fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
return -1;
}

while (bzError == BZ_OK) {
int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
size_t nwritten = fwrite(buf, 1, nread, stdout);
if (nwritten != (size_t) nread) {
fprintf(stderr, "E: short write\n");
return -1;
}
}
}

if (bzError != BZ_STREAM_END) {
fprintf(stderr, "E: bzip error after read: %d\n", bzError);
return -1;
}

BZ2_bzReadClose(&bzError, bzf);
return 0;
}

int
bunzip_many(const char *fname) {
FILE *f;

f = fopen(fname, "rb");
if (f == NULL) {
perror(fname);
return -1;
}

fseek(f, 0, SEEK_SET);
if (bunzip_one(f) == -1)
return -1;

fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
if (bunzip_one(f) == -1)
return -1;

fclose(f);
return 0;
}

int
main(int argc, char **argv) {
if (argc < 2) {
fprintf(stderr, "usage: bunz <fname>\n");
return EXIT_FAILURE;
}
return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}
  • I cared very much about proper error checking. For example, I made sure that bzError was BZ_OK or BZ_STREAM_END before trying to access the buffer. The documentation clearly says that for other values of bzError the returned number is undefined.
  • It shouldn't frighten you that about 50 percent of the code are concerned with error handling. That's how it should be. Expect errors everywhere.
  • The code still has some bugs. In case of errors it doesn't release the resources (f, bzf) properly.

And these are the commands I used for testing:

$ echo hello > hello
$ echo world > world
$ bzip2 hello
$ bzip2 world
$ cat hello.bz2 world.bz2 > helloworld.bz2
$ gcc -W -Wall -Os -o bunz bunz.c -lbz2
$ ls -l *.bz2
-rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
-rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
-rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
$ ./bunz.exe helloworld.bz2
hello
world

how to decompress .tar.bz2 in memory with python

For generic bz2 decompression, BZ2File class may be used.

from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()

content should contain the decompressed contents of the file.

However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used

tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()

The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.

The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.

Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used

with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()

Using R to download zipped data file, extract, and import data

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

julia: how to read a bz2 compressed text file

I don't know about anything automatic but this is how you could (create and) read a bz2 compressed file:

using CodecBzip2 # after ] add CodecBzip2

# Creating a dummy bz2 file
mystring = "Hello StackOverflow!"
mystring_compressed = transcode(Bzip2Compressor, mystring)
write("testfile.bz2", mystring_compressed)

# Reading and uncompressing it
compressed = read("testfile.bz2")
plain = transcode(Bzip2Decompressor, compressed)
String(plain) # "Hello StackOverflow!"

There are also streaming variants available. For more see CodecBzip2.jl.



Related Topics



Leave a reply



Submit