Using R to Download Gzipped Data File, Extract, and Import Data

Using R to download gzipped data file, extract, and import data

I like Ramnath's approach, but I would use temp files like so:

tmpdir <- tempdir()

url <- 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
file <- basename(url)
download.file(url, file)

untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)

The list.files() should produce something like this:

[1] "TicDataDescr.txt" "dictionary.txt"   "ticdata2000.txt"  "ticeval2000.txt"  "tictgts2000.txt" 

which you could parse if you needed to automate this process for a lot of files.

Using R to download zipped data file, extract, and import data

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to

  1. Create a temp. file name (eg tempfile())
  2. Use download.file() to fetch the file into the temp. file
  3. Use unz() to extract the target file from temp. file
  4. Remove the temp file via unlink()

which in code (thanks for basic example, but this is simpler) looks like

temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)

Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

Downloading and extracting .gz data file using R

Some additional options, with base R:

url <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
tmp <- tempfile()
##
download.file(url,tmp)
##
data <- read.csv(
gzfile(tmp),
sep="\t",
header=TRUE,
stringsAsFactors=FALSE)
names(data)[1] <- sub("X\\.","",names(data)[1])
##
R> head(data)
mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment
1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216 uuGAUAUGUUGGAUGAU-GGAGu
2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777 uugaUA-UGUU--GGAUGAUGGAGu
3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638 uugauaUGUUGG-AUGAUGGAgu
4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074 uuGAUAUGUUGGAUGAUGGAGu
5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504 uugaUAUGUUGGAUGAUGGAGu
6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962 uuGAUAUGUUGGAU-GAUGG-AGu
alignment gene_alignment mirna_start mirna_end gene_start gene_end
1 | :|: ||:|| ||| |||| aaCGGUGAAAUCU-CUAGCCUCu 2 21 495 516
2 || |||: ::||||||||: acaaAUCACAGUUUUUACUACCUUc 2 19 459 483
3 |::||: |||||||| aauuucAUGACUGUACUACCUga 3 17 77 99
4 || || | | ||||||| ccCUCUAGA---UUCUACCUCa 2 21 1282 1300
5 :|| |: |||||||| guagGUAAAGGAAACUACCUCa 2 19 6410 6431
6 || || ||| || ||||| || uaCUUUAAAACAUAUCUACCAUCu 2 21 2265 2288
genome_coordinates conservation align_score seed_cat energy mirsvr_score
1 [hg19:2:224840068-224840089:-] 0.5684 122 0 -14.73 -0.7269
2 [hg19:17:37092945-37092969:-] 0.6464 140 0 -16.38 -0.1156
3 [hg19:20:3904018-3904040:+] 0.6522 139 0 -16.04 -0.2066
4 [hg19:6:56966300-56966318:+] 0.7627 144 7 -14.51 -0.8609
5 [hg19:12:24964511-24964532:-] 0.6775 150 7 -15.09 -0.2735
6 [hg19:20:11906579-11906602:+] 0.5740 131 0 -12.59 -0.2540

Or if you are on a Unix-like system, you could obtain the .txt file (either outside of R or using system or system2 from within R) like this:

[nathan@nrussell tmp]$ url="http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz"
[nathan@nrussell tmp]$ wget "$url" && gunzip human_predictions_S_C_aug2010.txt.gz

and then proceed as above, where you are reading in human_predictions_S_C_aug2010.txt from wherever wget and gunzip were executed,

data <- read.csv(
"~/tmp/human_predictions_S_C_aug2010.txt",
stringsAsFactors=FALSE,
header=TRUE,
sep="\t")

in my case.

Using R to download zipped data file, extract, and import .csv

In order to get your data to download and uncompress, you need to set mode="wb"

download.file("...",temp, mode="wb")
unzip(temp, "gbr_Country_en_csv_v2.csv")
dd <- read.table("gbr_Country_en_csv_v2.csv", sep=",",skip=2, header=T)

It looks like the default is "w" which assumes a text files. If it was a plain csv file this would be fine. But since it's compressed, it's a binary file, hence the "wb". Without the "wb" part, you can't open the zip at all.

unzip a tar.gz file?

fn <- "http://s.wordpress.org/resources/survey/wp2011-survey.tar.gz"
download.file(fn,destfile="tmp.tar.gz")
untar("tmp.tar.gz",list=TRUE) ## check contents
untar("tmp.tar.gz")
## or, if you just want to extract the target file:
untar("tmp.tar.gz",files="wp2011-survey/anon-data.csv")
X <- read.csv("wp2011-survey/anon-data.csv")

Tom Wenseleers points out that the archive package can help with this:

library(archive)
library(readr)
read_csv(archive_read("tmp.tar.gz", file = 3), col_types = cols())

and that archive::archive_extract("tmp.tar.gz", files="wp2011-survey/anon-data.csv") is quite a bit faster than the in-built base R untar (especially for large archives) It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

R Reading in a zip data file without unzipping it

If your zip file is called Sales.zip and contains only a file called Sales.dat, I think you can simply do the following (assuming the file is in your working directory):

data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")

Decompress gz file using R

If you really want to uncompress the file, just use the untar function which does support gzip.
E.g.:

untar('chadwick-0.5.3.tar.gz')

How can I read a tar.gz file directly from a URL into Pandas?

You can use the BytesIO(In-Memory Stream) to keep the data in memory instead of saving the file to local machine.

Also As per the tarfile.open documentation, If fileobj is specified, it is used as an alternative to a file object opened in binary mode for name.

>>> import tarfile
>>> from io import BytesIO
>>>
>>> import requests
>>> import pandas as pd

>>> url = "https://github.com/beoutbreakprepared/nCoV2019/blob/master/latest_data/latestdata.tar.gz?raw=true"
>>> response = requests.get(url, stream=True)
>>> with tarfile.open(fileobj=BytesIO(response.raw.read()), mode="r:gz") as tar_file:
... for member in tar_file.getmembers():
... f = tar_file.extractfile(member)
... df = pd.read_csv(f)
... print(df)


Related Topics



Leave a reply



Submit