Importing only every Nth row from a .csv file in R
For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:
write.csv(1:1000, file='test.csv')
file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res
> res
X3 X3.1
1 7 7
2 11 11
3 15 15
4 19 19
5 23 23
6 27 27
7 31 31
8 35 35
Skip specific rows using read.csv in R
One way to do this is using two read.csv
commands, the first one reads the headers and the second one the data:
headers = read.csv(file, skip = 1, header = F, nrows = 1, as.is = T)
df = read.csv(file, skip = 3, header = F)
colnames(df)= headers
I've created the following text file to test this:
do not read
a,b,c
previous line are headers
1,2,3
4,5,6
The result is:
> df
a b c
1 1 2 3
2 4 5 6
How to select every Nth row in CSV file using python
Using the csv
module, plus itertools.islice()
to select 3 rows each time:
import csv
import os.path
from itertools import islice
with open(inputfilename, 'rb') as infh:
reader = csv.reader(infh)
for row in reader:
filename = row[0].replace(' ', '_') + '.csv')
filename = os.path.join(directory, filename)
with open(filename, 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerow(row)
writer.writerows(islice(reader, 2))
The writer.writerows(islice(reader, 2))
line takes the next 2 rows from the reader, copying them across to the writer CSV, after writing the current row (with the date) to the output file first.
You may need to adjust the delimiter
argument for the csv.reader()
and csv.writer()
objects; the default is a comma, but you didn't specify the exact format and perhaps you need to set it to a '\t'
tab instead.
If you are using Python 3, open the files with 'r'
and 'w'
text mode, and set newline=''
for both; open(inputfilename, 'r', newline='')
and open(filename, 'w', newline='')
.
Skipping rows starting with specific values while importing a CSV file into R using FREAD
You can read the data with read.csv
with fill = TRUE
, keep only those rows that have data in date format in date
column so values like '<<<<<<< HEAD'
or '======='
are removed and use type_convert
to change them in respective types.
data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\\d+-\\d+-\\d+', data$date), ]
data <- readr::type_convert(data)
data
# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows
and with data.table::fread
you can use blank.lines.skip=TRUE
.
data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)
how to copy consecutive and skip n- lines in csv?
Nice challenge. Here is a pure batch solution:
@echo off
setlocal enabledelayedexpansion
REM following code to produce some data for testing:
(
echo date,hour,temp
echo 20181231,24,99
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019010%%a,%%b,!random:~-2!
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019011%%a,%%b,!random:~-2!
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019012%%a,%%b,!random:~-2!
)>hourdata-test.csv
REM code to extract desired values
REM expected hour-pairs: 1,2 - 10,11 - 19,20 - 4,5 - 13,14 - 22,23 - 7,8 - 16,17 : repeat
(for /f "tokens=1,* delims=:" %%a in ('findstr /n "^" hourdata-test.csv') do (
set /a "x=%%a %% 9"
if !x! == 3 echo %%b
if !x! == 4 echo %%b
))>ninerdata.csv
The trick is to use the line numbers, calculate Modulo 9
and then simply compare the resulting value. Skipping the first two lines is achieved by printing the modulo numbers 3 and 4.
A full year of data should take less than 2 seconds.
R- import CSV file, all data fall into one (the first) column
Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv
mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character
vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1
data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.
Importing and extracting a random sample from a large .CSV in R
I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table
or fread
(data.table package)) .
Using perl
you can easily do this task. For example , to read 1% of your file in a random way, you can do this :
xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)
Here I am calling it from R using system
. xx contain now only 1% of your file.
You can wrap all this in a function:
read_partial_rand <-
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}
How to read a CSV file every other row
you could read them all into memory with numpy
and store every other row:
import numpy as np
import pandas as pd
data = np.loadtxt(filename)
data = pd.DataFrame(data[::2])
The last bit, [::2]
, means "take every second element".
Related Topics
Row/Column Counter in 'Apply' Functions
Ggplot2 - Multi-Group Histogram with In-Group Proportions Rather Than Frequency
Formatting Mouse Over Labels in Plotly When Using Ggplotly
Disregarding Simple Warnings/Errors in Trycatch()
Reading Hdf Files into R and Converting Them to Geotiff Rasters
R: Losing Column Names When Adding Rows to an Empty Data Frame
Replace Characters from a Column of a Data Frame R
Use Superscripts in R Axis Labels
How to Have Conditional Markdown Chunk Execution in Rmarkdown
How to Break Out of a Foreach Loop
Mgcv: How to Set Number And/Or Locations of Knots for Splines
Find the Index Position of the First Non-Na Value in an R Vector
Generating a Vector of Difference Between Two Vectors
Rename a Sequence of Variable Names in Data Frame
When Using Ggplot in R, How to Remove Margins Surrounding the Plot Area