R Reading in a Zip Data File Without Unzipping It

R Reading in a zip data file without unzipping it

If your zip file is called Sales.zip and contains only a file called Sales.dat, I think you can simply do the following (assuming the file is in your working directory):

data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")

R Reading in a zip data file without unzipping it (loss of information)

You can also always use fread() from data.table. You can execute arbitrary shell commands from the file argument to handle the unzip, and it won't auto coerce your timestamps by default either, so you shouldn't have the truncation issue. The vignette Convenience features of fread has some great examples.

(Bonus, it's significantly faster than reader, and absolutely blows it out of the water if you install the development v1.10.5 version off github with multi-threading in fread.\

library(data.table)

myData <- fread("gunzip -c foo.txt.gz")

When reading in data from a zip-file in R, it corrupts the previous read-in data

The problem is the default behavior of the read_delim() function. In order to improve performance the data is loaded in a lazy manner, meaning the data is only accessed when needed.

So in actuality the return value from "f_get_data" is just a pointer to the data. In this case it is a pointer your temporary file which is overwritten on each call to the function.

To solve this, set lazy to FALSE in the read_delim() function call.

df <- read_delim(unzip(zip_file, files = data), delim = ",", lazy=FALSE) %>%
mutate(year = i + 2015)

Reading a zip file in R without knowing the csv file name within it

Why don't you try using unzip to find the filename inside the ZIP archive:

zipdf <- unzip(zip_file, list = TRUE)
# the following line assuming the archive has only a single file
csv_file <- zipdf$Name[0]

your_df <- read.table(csv_file, skip = 10, nrows=10, header=T, quote="\"", sep=",")

Using R to download zipped data file, extract, and import data

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)



Related Topics



Leave a reply



Submit