Read binary files in R from a zipped file and a known starting position (byte offset)
Here's a bit of a hack that might work for you. Here's a fake binary file:
writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
# [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10
And here's the produced zip file:
zip("file.zip", "file.bin")
# adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
# [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f
This uses a temporary intermediate binary file.
system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09
This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.
This worked on win10, R-3.3.2. I'm using dd
from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip
and sh
from RTools.
Sys.which(c("dd", "unzip", "sh"))
# dd
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe"
# unzip
# "c:\\Rtools\\bin\\unzip.exe"
# sh
# "c:\\Rtools\\bin\\sh.exe"
dealing with binary file in R
library(tidyverse)
library(httr)
tmp <- tempfile()
GET("http://example.com/file.zip", write_disk(tmp))
df <- unzip(tmp) %>% read_csv()
How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?
welcome!
Based on the documentation for the API the response to the getDataset
endpoint has schema
Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP
transfers.
{
"status": "OK",
"dataset": {
"state_id": 5,
"session_id": 1624,
"session_name": "2019-2020 Regular Session",
"dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
"dataset_date": "2018-12-23",
"dataset_size": 317775,
"mime": "application\/zip",
"zip": "MIME 64 Encoded Document"
}
}
We can use R for obtaining the data with the following code,
library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
token,
"&op=getDataset&id=",
session_id,
"&access_key=",
access_key) %>%
GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() # This contains some extra metadata
content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() %>%
getElement(name = "dataset") %>%
getElement(name = "zip") %>%
base64_dec() %>%
writeBin(con = destfile)
unzip(zipfile = destfile)
unzip
will unzip the files which in this case will look like
hash.md5 # Can be checked against the metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json
As always, wrap your code in functions and profit.
PS: Here is how the code would like like in Julia as a comparison.
using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
join(["key=$token",
"op=getDataset",
"id=$session_id",
"access_key=$access_key"],
"&")) |>
HTTP.get
@assert response.status == 200
JSON3.read(response.body) |>
(content -> content.dataset.zip) |>
base64decode |>
(data -> write(destfile, data))
run(pipeline(`unzip`, destfile))
Read a zip file in R from a subfolder
You can explicitly specify the path within the archive file:
temp <- tempfile()
download.file("http://seanlahman.com/files/database/baseballdatabank-2017.1.zip", temp, mode="wb")
table1 <- unz(temp, "baseballdatabank-2017.1/core/Salaries.csv")
salaries <- read.csv(table1, sep=",", header=T)
reading binary files with R
Answering your questions directly:
I am having some issues to convert this code...
What is the problem here? Your code block contains the comment "but it's the same story", but what is the story? You haven't explained anything here. If your problem is with the double, you should try setting
readBin(..., size = 8)
. In your case, your code would readline1 <- c(readBin(to.read,"integer", 2), readBin(to.read, "double", 1, 8))
.How can I read float (in c# i have rb.ReadSingle()) in R?
Floats are 4 bytes in size in this case (I would presume), so set
size = 4
inreadBin()
.Is there in R a function to memorize the position that you have arrived when you are reading a binary file? So next time you will read it again, you could skip what you have already read (as in c# with
BinaryReader
)As far as I know there is nothing available (more knowledgeable people are welcome to add their inputs). You could, however, easily write a wrapper script for
readBin()
that does this for you. For instance, you could specify how many bytes you want to discard (i.e., this can correspond ton
bytes that you have already read into R), and read in that many bytes via a dummyreadBin()
like soreadBin(con = yourinput, what = "raw", n = n)
, where the integern
would indicate the number of bytes you wish to throw away. Thereafter, you could have your wrapper script go read succeeding bytes into a variable of your choice.
Read binary file by parts
Use the seek()
function, just as you would in a C program.
Make a test file:
> cat(LETTERS,file="letters.txt")
See what it is - upper case with space sep:
> system("cat letters.txt") # unix only
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Open:
> con = file("letters.txt","rb")
Go somewhere and read a few:
> seek(con,3)
[1] 0
> readBin(con,"raw",10)
[1] 20 43 20 44 20 45 20 46 20 47
Those are ASCII codes. Go somewhere else and read a few more:
> seek(con,7)
[1] 13
> readBin(con,"raw",10)
[1] 20 45 20 46 20 47 20 48 20 49
Related Topics
Creating Sequence of Dates for Each Group in R
Cannot Install Stringi Since Xcode Command Line Tools Update
Stopping the Script Until a Value Is Entred from Keyboard in R
Shiny: How to Stop Processing Invalidatelater() After Data Was Abtained or at the Given Time
Ggplot2 Force Y-Axis to Start at Origin and Float Y-Axis Upper Limit
Error: Could Not Find Build Tools Necessary to Build Dplyr
Ordering Factors in Each Facet of Ggplot by Y-Axis Value
R Cmd Check Not Looking for Gcc in Rtools Directory
R Shiny: Multiple Use in UI of Same Renderui in Server
Ggplot2: How to Rotate a Graph in a Specific Angle
"Non-Finite Function Value" When Using Integrate() in R
Select N Rows Above and Below Match
R: Pivoting Using 'Spread' Function
Use Hooks to Format Table in Output
How to Automate Nested Sections in Rmds Which Include Text, Maps and Tables