Automate zip file reading in R
You can use unzip
to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir
), just use list.files
to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv
is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir
location is stored in temp_path
.
R Reading in a zip data file without unzipping it
If your zip file is called Sales.zip
and contains only a file called Sales.dat
, I think you can simply do the following (assuming the file is in your working directory):
data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")
Using R to download zipped data file, extract, and import data
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip)
for details. So to do what you sketch out above you need to
- Create a temp. file name (eg
tempfile()
) - Use
download.file()
to fetch the file into the temp. file - Use
unz()
to extract the target file from temp. file - Remove the temp file via
unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z
) or gzipped (.gz
) or bzip2ed (.bz2
) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
How can I extract multiple zip files and read those csvs in R?
I often use the ldply function from the plyr package to read or do stuff with multiple files.
library(plyr)
# get all the zip files
zipF <- list.files(path = "/your/path/here/", pattern = "*.zip", full.names = TRUE)
# unzip all your files
ldply(.data = zipF, .fun = unzip, exdir = outDir)
As Richard pointed out this is not complete (to much coffee in the morning is also not good).
# get the csv files
csv_files <- list.files(path = outDir, pattern = "*.csv")
# read the csv files
my_data <- ldply(.data = csv_files, .fun = read.csv)
I liked the comment of Joel a lot. I'am used to using the plyr package so much that I forgot that you can also use the sapply function. Maybe even better to use!
Extract certain files from .zip
Thanks to comment from @user20650.
Use two calls to unzip
. First with list=TRUE
just to get the $Name
for the files. Second with files=
to extract only the files whose names match the pattern.
zipped_csv_names <- grep('\\.csv$', unzip('some_archive.zip', list=TRUE)$Name,
ignore.case=TRUE, value=TRUE)
unzip('some_archive.zip', files=zipped_csv_names)
comb_tbl <- rbindlist(lapply(zipped_csv_names,
function(x) cbind(fread(x, sep=',', header=TRUE,
stringsAsFactors=FALSE),
file_nm=x)), fill=TRUE )
R Reading in a zip data file without unzipping it (loss of information)
You can also always use fread()
from data.table
. You can execute arbitrary shell commands from the file argument to handle the unzip, and it won't auto coerce your timestamps by default either, so you shouldn't have the truncation issue. The vignette Convenience features of fread has some great examples.
(Bonus, it's significantly faster than reader
, and absolutely blows it out of the water if you install the development v1.10.5 version off github with multi-threading in fread
.\
library(data.table)
myData <- fread("gunzip -c foo.txt.gz")
Batch reading compressed CSV files in R
I suspect all that was missing was the correct path to the file in the archive: neither "members_GE_FL.csv"
nor "./files/member_database/members/state/members_GE_FL.csv"
will work.
But "files/member_database/members/state/members_GE_FL.csv"
(without the initial dot) should.
For the sake of completeness, here is a complete example:
Let's create some dummy data, three files named out-1.csv
, out-2.csv
, out-3.csv
and zip them in dummy-archive.zip
:
if (!dir.exists("data")) dir.create("data")
if (!dir.exists("data/dummy-files")) dir.create("data/dummy-files")
for (i in 1:3)
write.csv(data.frame(foo = 1:2, bar = 7:8), paste0("data/dummy-files/out-", i, ".csv"), row.names = FALSE)
zip("data/dummy-archive.zip", "data/dummy-files")
Now let's assume we're looking for 3 other files, two of which are in the archive, one is not:
files_to_find <- c("out-2.csv", "out-3.csv", "out-4.csv")
List the files in the archive, and name them for the sake of clarity:
files_in_archive <- unzip("data/dummy-archive.zip", list = TRUE)$Name
files_in_archive <- setNames(files_in_archive, basename(files_in_archive))
# dummy-files out-2.csv
# "data/dummy-files/" "data/dummy-files/out-2.csv"
# out-3.csv out-1.csv
# "data/dummy-files/out-3.csv" "data/dummy-files/out-1.csv"
Find the indices of files we're looking for in the archive, and read them like you intended to (with read.csv(unz(....))
):
i <- basename(files_in_archive) %in% files_to_find
res <- lapply(files_in_archive[i], function(f) read.csv(unz("data/dummy-archive.zip", f)))
# $`out-2.csv`
# foo bar
# 1 1 7
# 2 2 8
#
# $`out-3.csv`
# foo bar
# 1 1 7
# 2 2 8
Clean-up:
unlink(c("data/dummy-files/", "data/dummy-archive.zip"), recursive = TRUE)
Related Topics
Display HTML File in Shiny App
Extracting a Random Sample of Rows in a Data.Frame with a Nested Conditional
Tidyverse - Prefered Way to Turn a Named Vector into a Data.Frame/Tibble
Using R Convert Data.Frame to Simple Vector
Explicitly Set Panel Size (Not Just Plot Size) in Ggplot2
Trying to Merge Multiple CSV Files in R
Adding an Repeated Index for Factors in Data Frame
Sort Year-Month Column by Year and Month
Format Date-Time as Seasons in R
Using Geom_Rect for Time Series Shading in R
Add a New Column Between Other Dataframe Columns
How to Find Common Rows Between Two Dataframe in R
Dplyr::Do() Requires Named Function
How to Pass "Nothing" as an Argument to '[' for Subsetting
Reduce File Size of R Markdown HTML Output