Quickest way to read a subset of rows of a CSV
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris
dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
Is it possible to efficiently get a subset of rows from a large fixed-width CSV file?
As I proposed in the comment, you can compress your data field to two bits:
-- 00
AA 01
AB 10
BB 11
That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.
The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)
Here is how I build that binary file:
#include <fstream>
#include <iostream>
#include <string>
#include <chrono>
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
std::ifstream src("data.txt", std::ios::binary);
std::ofstream bin("data.bin", std::ios::binary);
size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
std::string str(length, '\0');
while (src.read(&str[0], length))
{
size_t pos = str.find(',') + 1;
for (int group = 0; group < 2500; ++group) {
uint64_t compressed(0), field(0);
for (int i = 0; i < 32; ++i, pos += 3) {
if (str[pos] == '-')
field = 0;
else if (str[pos] == 'B')
field = 3;
else if (str[pos + 1] == 'B')
field = 2;
else
field = 1;
compressed <<= 2;
compressed |= field;
}
bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
}
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
// clear `bad` bit set by trying to read past EOF
src.clear();
// rewind to the first record
src.seekg(0);
src.read(&str[0], length);
// read next (second) record
src.read(&str[0], length);
// read forty second record from start (skip 41)
src.seekg(41 * length, std::ios_base::beg);
src.read(&str[0], length);
// read next (forty third) record
src.read(&str[0], length);
// read fifties record (skip 6 from current position)
src.seekg(6 * length, std::ios_base::cur);
src.read(&str[0], length);
return 0;
}
This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?
UPDATE:
Added example of how to read individual records from src
.
I only managed to make seekg()
work in binary mode.
How to get some subset of data from a csv file for big-data(comparing csv's)?
Read in your data with an appropriate import function like read.csv
and then join it.
library(dplyr)
## read your files (possibly you need to adjust some arguments in read.csv)
file1 <- read.csv("path/to/file1.csv", header = TRUE)
file2 <- read.csv("path/to/file2.csv", header = TRUE)
file2 %>%
left_join(file1, by = "SYMBOL)
Strategies for reading in CSV files in pieces?
You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql
in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql
and ?sqldf
and also the sqldf home page.
How can I partially read a huge CSV file?
Use chunksize
:
for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
#do something
To answer your second part do this:
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)
This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.
Related Topics
R: How to Make a Confusion Matrix for a Predictive Model
Packages Missing in Shiny-Server
Applying Some Functions to Multiple Objects
How to Split Column into Two in R Using Separate
Shiny: Unwanted Space Added by Plotoutput() And/Or Renderplot()
R: Scatter Plot Matrix Using Ggplot2 with Themes That Vary by Facet Panel
R Dataframe: Aggregating Strings Within Column, Across Rows, by Group
How to Melt R Data.Frame and Plot Group by Bar Plot
Heat Map Per Column with Ggplot2
Removing Text Containing Non-English Character
Subtract Every Column from Each Other Column in a R Data.Table
How to Add a Title to Legend Scale Using Levelplot in R
How to Get Outliers for All the Columns in a Dataframe in R
Converting a Data.Frame to a List of Lists
Str_Extract_All: Return All Patterns Found in String Concatenated as Vector