How to Read Only Lines That Fulfil a Condition from a CSV into R

How to read only lines that fulfil a condition from a csv into R?

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")

how to read huge csv file into R by row condition?

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space,
# as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv",
row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

Reading a messy csv using readLines until a certain row / cell value

I think this is most likely a bad idea, in that it's more likely to slow down the process, rather than speed it up. I can see though, that if you've got a very large file, a large portion of which can be avoided by doing this, there could be a benefit.

library( readr )
line <- 0L
input <- "start"
while( !grepl( "Data", input ) & input != "" ) {
line <- line + 1L
input <- read_lines( file, skip = line - 1L, n_max = 1L )
}
line

We read one line at a time. For each line, we check for the text "Data" or a blank line. If either condition is fulfilled, we stop reading, which leaves us with line, a value telling us the first line not to be read in. This way you can then call something like:

df <- read_lines( file, n_max = line - 1L )

UPDATE: adding an option to test and read concurrently, as per @konvas's suggestion.

read_with_condition <- function( file, lines.guess = 100L ) {
line <- 1L
output <- vector( mode = "character", length = lines.guess )
input <- "start"
while( !grepl( "Data", input ) & input != "" ) {
input <- readr::read_lines( file, skip = line - 1L, n_max = 1L )
output[line] <- input
line <- line + 1L
}
# discard any unwanted space in the output vector
# this will also discard the last line to be read in (which failed the test)
output <- output[ seq_len( line - 2L ) ]
cat( paste0( "Stopped reading at line ", line - 1L, ".\n" ) )
return( output )
}

new <- read_with_condition( file, lines.guess = 100L )

So here we are testing the input condition, and writing the input line to an object at the same time. You can preallocate space in the output vector with lines.guess (a good guess should speed up the processing, be generous rather than conservative here), and any excess will be cleaned up at the end. Note this is a function, so the last line new <- ... is showing how to call the function.

R doesn´t read all lines of a csv

There should be no need to convert your excel file. Simply:

install.packages("rio")
rio::import("example.xlsx")

*(rio is just a wrapper for different import/export packages/functions but the default values worked in 99% of my cases so far.)

How to pre-process a CSV file before importing into R?

You can load the data as a text file using readLines() and each line will be stored in a vector as strings. Then, you'll be able to analyze your data and find the structure that fits the best on your problem.

Here is a code chunk that may help you:

# load environment
library(stringr)

# define the data path
data_path = '~/Downloads/file.csv'
# load data as a character vector
data = readLines(data_path)
# remove the first column, since it seems to be unuseful
data = str_remove(data, '^., ')
# detect and keep lines having 3 columns (2 commas)
c = str_count(data, ',')
data = data[c == 2]
# get rid of descriptors
d = !str_detect(data, 'num|char')
data = data[d]
# overwrite the data
writeLines(data, data_path)

# now load the data as a dataframe
df = read.csv(data_path)
# print output
print(df)

Here is the output:

  col_1 col_2 col_3
1 1 a b
2 2 c d
3 3 e f
4 4 g h
5 5 i j

The solution is not so generalized, but I think you cannot avoid detecting specific patterns, in order to remove/keep them from your data.

Let us know if it helped you somehow..!



Related Topics



Leave a reply



Submit