How to Read Huge CSV File into R by Row Condition

how to read huge csv file into R by row condition?

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space,
# as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv",
row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

How to read only lines that fulfil a condition from a csv into R?

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")

How to deal with a 50GB large csv file in r language?

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

The example from the docs:

library(sqldf)

iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")

I've used this library on VERY large CSV files with good results.

Large csv file fails to fully read in to R data.frame

As, "read.csv()" read up to 1080000 rows, "fread" from library(data.table) should read it with ease. If not, there exists two other options, either try with library(h20) or with "fread" you can use select option to read required columns (or read in two halves, do some cleaning and can merge them back).

R how to read text file based on condition

read.table() isn't going to be able to easily read this. R expects most data data to be clean and rectangular.

You can read the data in as a bunch of lines, manipulate those lines to a more regular format, and then parse that using read.table. For example

# Read your data file
# xx <- readLines("mydatafile.txt")
# for the sake of a complete example
xx <- scan(text="1:
123,3,2002-09-06
456,2,2005-08-13
789,4,2001-09-20
2:
123,5,2003-05-08
321,1,2004-06-15
432,3,2001-09-11", what=character())

This reads in the lines as just strings. Then you can split into groups and append the item ID as another value to each row

item_group <- cumsum(grepl("\\d+:", xx))
clean_rows <- unlist(lapply(split(xx, item_group), function(x) {
item_id = gsub(":$",",", x[1])
paste0(item_id, x[-1])
}))

Then you can parse the data into a data.frame

read.table(text=clean_rows, sep=",", col.names=c("itemID","UserID","Quantity","Date"))

Reading 40 GB csv file into R using bigmemory

I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
!/NULL/{if (rand() < m/(length - NR + 1)) {
print; m--;
if (m == 0) exit;
}}\' filename'
)) -> df

It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.



Related Topics



Leave a reply



Submit