R: Reading a Binary File That Is Zipped

Read binary files in R from a zipped file and a known starting position (byte offset)

Here's a bit of a hack that might work for you. Here's a fake binary file:

writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
# [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10

And here's the produced zip file:

zip("file.zip", "file.bin")
# adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
# [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f

This uses a temporary intermediate binary file.

system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09

This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.

This worked on win10, R-3.3.2. I'm using dd from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip and sh from RTools.

Sys.which(c("dd", "unzip", "sh"))
# dd
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe"
# unzip
# "c:\\Rtools\\bin\\unzip.exe"
# sh
# "c:\\Rtools\\bin\\sh.exe"

dealing with binary file in R

library(tidyverse)
library(httr)

tmp <- tempfile()
GET("http://example.com/file.zip", write_disk(tmp))

df <- unzip(tmp) %>% read_csv()

How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?

welcome!

Based on the documentation for the API the response to the getDataset endpoint has schema

Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP
transfers.

{
"status": "OK",
"dataset": {
"state_id": 5,
"session_id": 1624,
"session_name": "2019-2020 Regular Session",
"dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
"dataset_date": "2018-12-23",
"dataset_size": 317775,
"mime": "application\/zip",
"zip": "MIME 64 Encoded Document"
}
}

We can use R for obtaining the data with the following code,

library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
token,
"&op=getDataset&id=",
session_id,
"&access_key=",
access_key) %>%
GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() # This contains some extra metadata
content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() %>%
getElement(name = "dataset") %>%
getElement(name = "zip") %>%
base64_dec() %>%
writeBin(con = destfile)
unzip(zipfile = destfile)

unzip will unzip the files which in this case will look like

hash.md5 # Can be checked against the metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json

As always, wrap your code in functions and profit.

PS: Here is how the code would like like in Julia as a comparison.

using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
join(["key=$token",
"op=getDataset",
"id=$session_id",
"access_key=$access_key"],
"&")) |>
HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
(content -> content.dataset.zip) |>
base64decode |>
(data -> write(destfile, data))
run(pipeline(`unzip`, destfile))

Read a zip file in R from a subfolder

You can explicitly specify the path within the archive file:

temp <- tempfile()
download.file("http://seanlahman.com/files/database/baseballdatabank-2017.1.zip", temp, mode="wb")
table1 <- unz(temp, "baseballdatabank-2017.1/core/Salaries.csv")
salaries <- read.csv(table1, sep=",", header=T)

reading binary files with R

Answering your questions directly:

  1. I am having some issues to convert this code...

    What is the problem here? Your code block contains the comment "but it's the same story", but what is the story? You haven't explained anything here. If your problem is with the double, you should try setting readBin(..., size = 8). In your case, your code would read line1 <- c(readBin(to.read,"integer", 2), readBin(to.read, "double", 1, 8)).

  2. How can I read float (in c# i have rb.ReadSingle()) in R?

    Floats are 4 bytes in size in this case (I would presume), so set size = 4 in readBin().

  3. Is there in R a function to memorize the position that you have arrived when you are reading a binary file? So next time you will read it again, you could skip what you have already read (as in c# with BinaryReader)

    As far as I know there is nothing available (more knowledgeable people are welcome to add their inputs). You could, however, easily write a wrapper script for readBin() that does this for you. For instance, you could specify how many bytes you want to discard (i.e., this can correspond to n bytes that you have already read into R), and read in that many bytes via a dummy readBin() like so readBin(con = yourinput, what = "raw", n = n), where the integer n would indicate the number of bytes you wish to throw away. Thereafter, you could have your wrapper script go read succeeding bytes into a variable of your choice.

Read binary file by parts

Use the seek() function, just as you would in a C program.

Make a test file:

> cat(LETTERS,file="letters.txt")

See what it is - upper case with space sep:

> system("cat letters.txt") # unix only
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Open:

> con = file("letters.txt","rb")

Go somewhere and read a few:

> seek(con,3)
[1] 0
> readBin(con,"raw",10)
[1] 20 43 20 44 20 45 20 46 20 47

Those are ASCII codes. Go somewhere else and read a few more:

> seek(con,7)
[1] 13
> readBin(con,"raw",10)
[1] 20 45 20 46 20 47 20 48 20 49


Related Topics



Leave a reply



Submit