Automate Zip File Reading in R

Automate zip file reading in R

You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:

l = list.files(temp_path)
read.csv(l[1])

assuming your tempdir location is stored in temp_path.

R Reading in a zip data file without unzipping it

If your zip file is called Sales.zip and contains only a file called Sales.dat, I think you can simply do the following (assuming the file is in your working directory):

data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")

Using R to download zipped data file, extract, and import data

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

How can I extract multiple zip files and read those csvs in R?

I often use the ldply function from the plyr package to read or do stuff with multiple files.

library(plyr)

# get all the zip files
zipF <- list.files(path = "/your/path/here/", pattern = "*.zip", full.names = TRUE)

# unzip all your files
ldply(.data = zipF, .fun = unzip, exdir = outDir)

As Richard pointed out this is not complete (to much coffee in the morning is also not good).

# get the csv files
csv_files <- list.files(path = outDir, pattern = "*.csv")

# read the csv files
my_data <- ldply(.data = csv_files, .fun = read.csv)

I liked the comment of Joel a lot. I'am used to using the plyr package so much that I forgot that you can also use the sapply function. Maybe even better to use!

Extract certain files from .zip

Thanks to comment from @user20650.

Use two calls to unzip. First with list=TRUE just to get the $Name for the files. Second with files= to extract only the files whose names match the pattern.

  zipped_csv_names <- grep('\\.csv$', unzip('some_archive.zip', list=TRUE)$Name, 
ignore.case=TRUE, value=TRUE)
unzip('some_archive.zip', files=zipped_csv_names)
comb_tbl <- rbindlist(lapply(zipped_csv_names,
function(x) cbind(fread(x, sep=',', header=TRUE,
stringsAsFactors=FALSE),
file_nm=x)), fill=TRUE )

R Reading in a zip data file without unzipping it (loss of information)

You can also always use fread() from data.table. You can execute arbitrary shell commands from the file argument to handle the unzip, and it won't auto coerce your timestamps by default either, so you shouldn't have the truncation issue. The vignette Convenience features of fread has some great examples.

(Bonus, it's significantly faster than reader, and absolutely blows it out of the water if you install the development v1.10.5 version off github with multi-threading in fread.\

library(data.table)

myData <- fread("gunzip -c foo.txt.gz")

Batch reading compressed CSV files in R

I suspect all that was missing was the correct path to the file in the archive: neither "members_GE_FL.csv" nor "./files/member_database/members/state/members_GE_FL.csv" will work.

But "files/member_database/members/state/members_GE_FL.csv" (without the initial dot) should.

For the sake of completeness, here is a complete example:

Let's create some dummy data, three files named out-1.csv, out-2.csv, out-3.csv and zip them in dummy-archive.zip:

if (!dir.exists("data")) dir.create("data")
if (!dir.exists("data/dummy-files")) dir.create("data/dummy-files")
for (i in 1:3)
write.csv(data.frame(foo = 1:2, bar = 7:8), paste0("data/dummy-files/out-", i, ".csv"), row.names = FALSE)
zip("data/dummy-archive.zip", "data/dummy-files")

Now let's assume we're looking for 3 other files, two of which are in the archive, one is not:

files_to_find <- c("out-2.csv", "out-3.csv", "out-4.csv")

List the files in the archive, and name them for the sake of clarity:

files_in_archive <- unzip("data/dummy-archive.zip", list = TRUE)$Name
files_in_archive <- setNames(files_in_archive, basename(files_in_archive))

# dummy-files out-2.csv
# "data/dummy-files/" "data/dummy-files/out-2.csv"
# out-3.csv out-1.csv
# "data/dummy-files/out-3.csv" "data/dummy-files/out-1.csv"

Find the indices of files we're looking for in the archive, and read them like you intended to (with read.csv(unz(....))):

i <- basename(files_in_archive) %in% files_to_find
res <- lapply(files_in_archive[i], function(f) read.csv(unz("data/dummy-archive.zip", f)))

# $`out-2.csv`
# foo bar
# 1 1 7
# 2 2 8
#
# $`out-3.csv`
# foo bar
# 1 1 7
# 2 2 8

Clean-up:

unlink(c("data/dummy-files/", "data/dummy-archive.zip"), recursive = TRUE)


Related Topics



Leave a reply



Submit