R: Read in random rows from file using fread or equivalent?
Using the tidyverse (as opposed to data.table), you could do:
library(readr)
library(purrr)
library(dplyr)
# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start,
# giving us a total of 9000 rows in the final
start_at <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))
# sort the index sequentially
start_at <- start_at[order(start_at)]
# Read in 10 rows at a time, starting at your random numbers,
# binding results rowwise into a single data frame
sample_of_rows <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )
Quickest way to read a subset of rows of a CSV
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris
dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
How to read multiple files once using 'fread' in R
First, you need to list all files that you want to read. Then, you could use a loop to capture the data in a list like so:
filelist <- list.files(pattern='.snplist')
datalist <- list()
for(i in seq_along(filelist)) {
datalist[[i]] <- fread(filelist[i])
}
Note we use seq_along
instead of 1:length(filelist)
to avoid errors in case filelist
is empty (length 0).
Skipping rows starting with specific values while importing a CSV file into R using FREAD
You can read the data with read.csv
with fill = TRUE
, keep only those rows that have data in date format in date
column so values like '<<<<<<< HEAD'
or '======='
are removed and use type_convert
to change them in respective types.
data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\\d+-\\d+-\\d+', data$date), ]
data <- readr::type_convert(data)
data
# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows
and with data.table::fread
you can use blank.lines.skip=TRUE
.
data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)
Random Sample each datafile in my list before rbind them into a datafram using R
You might try using the read and write functions from data.table
. fread
has a really cool auto-start function which intelligently chooses columns and header information.
library(data.table)
setwd("C:/Users/mli/Desktop/3S_DMSO")
txt_files_ls = list.files(pattern="*.txt")
txt_files_df <- lapply(txt_files_ls, fread)
sampled_txt_files_df <- lapply(txt_files_df,function(x){
x[sample(1:nrow(x), ceiling(nrow(x) * 0.2)),1:131]
})
combined_df <- rbindlist(sampled_txt_files_df)
fwrite(combined_df,"3SDMSO_merged.csv",row.names = FALSE)
Related Topics
Variable Results with Dplyr Summarise, Depending on Output Variable Naming
R: How to Retrieve a Column Name of a Data Frame
Recode Multiple Columns Using Dplyr
Convert Jpg to Greyscale CSV Using R
Calculate Row Means Based on (Partial) Matching Column Names
Adding Grouped Mean Values to Column in Data Frame
Pivot Wider Produces Nested Object
Subsetting a Data Frame to the Rows Not Appearing in Another Data Frame
Convert Month's Number to Month Name
Change Line Color Depending on Y Value with Ggplot2
In R, How to Split Timestamp Interval Data into Regular Slots
Web Scraping a Tableauviz into an R Dataframe
Cumsum Reset at Certain Values
Making Sure a Function Does Not Use a Global Variable
How to Extract Text from R's Help Command
How to Convert All Column Data Type to Numeric and Character Dynamically
Function/Loop to Replace Na with Values in Adjacent Columns in R