R: Read in Random Rows from File Using Fread or Equivalent

R: Read in random rows from file using fread or equivalent?

Using the tidyverse (as opposed to data.table), you could do:

library(readr)
library(purrr)
library(dplyr)

# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))

# sort the index sequentially
start_at <- start_at[order(start_at)]

# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )

Quickest way to read a subset of rows of a CSV

I think this should work pretty quickly, but let me know since I have not tried with big data yet.

write.csv(iris,"iris.csv")

fread("shuf -n 5 iris.csv")

V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica

This takes a random sample of N=5 for the iris dataset.

To avoid the chance of using the header row again, this might be a useful modification:

fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)

How to read multiple files once using 'fread' in R

First, you need to list all files that you want to read. Then, you could use a loop to capture the data in a list like so:

filelist <- list.files(pattern='.snplist')
datalist <- list()
for(i in seq_along(filelist)) {
datalist[[i]] <- fread(filelist[i])
}

Note we use seq_along instead of 1:length(filelist) to avoid errors in case filelist is empty (length 0).

Skipping rows starting with specific values while importing a CSV file into R using FREAD

You can read the data with read.csv with fill = TRUE, keep only those rows that have data in date format in date column so values like '<<<<<<< HEAD' or '=======' are removed and use type_convert to change them in respective types.

data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\\d+-\\d+-\\d+', data$date), ]
data <- readr::type_convert(data)
data

# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows

and with data.table::fread you can use blank.lines.skip=TRUE.

data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)

Random Sample each datafile in my list before rbind them into a datafram using R

You might try using the read and write functions from data.table. fread has a really cool auto-start function which intelligently chooses columns and header information.

library(data.table)
setwd("C:/Users/mli/Desktop/3S_DMSO")
txt_files_ls = list.files(pattern="*.txt")
txt_files_df <- lapply(txt_files_ls, fread)
sampled_txt_files_df <- lapply(txt_files_df,function(x){
x[sample(1:nrow(x), ceiling(nrow(x) * 0.2)),1:131]
})
combined_df <- rbindlist(sampled_txt_files_df)
fwrite(combined_df,"3SDMSO_merged.csv",row.names = FALSE)


Related Topics



Leave a reply



Submit