Importing Only Every Nth Row from a .CSV File in R

Importing only every Nth row from a .csv file in R

For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:

write.csv(1:1000, file='test.csv')

file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res

> res
X3 X3.1
1 7 7
2 11 11
3 15 15
4 19 19
5 23 23
6 27 27
7 31 31
8 35 35

Skip specific rows using read.csv in R

One way to do this is using two read.csv commands, the first one reads the headers and the second one the data:

headers = read.csv(file, skip = 1, header = F, nrows = 1, as.is = T)
df = read.csv(file, skip = 3, header = F)
colnames(df)= headers

I've created the following text file to test this:

do not read
a,b,c
previous line are headers
1,2,3
4,5,6

The result is:

> df
a b c
1 1 2 3
2 4 5 6

How to select every Nth row in CSV file using python

Using the csv module, plus itertools.islice() to select 3 rows each time:

import csv
import os.path
from itertools import islice

with open(inputfilename, 'rb') as infh:
reader = csv.reader(infh)
for row in reader:
filename = row[0].replace(' ', '_') + '.csv')
filename = os.path.join(directory, filename)
with open(filename, 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerow(row)
writer.writerows(islice(reader, 2))

The writer.writerows(islice(reader, 2)) line takes the next 2 rows from the reader, copying them across to the writer CSV, after writing the current row (with the date) to the output file first.

You may need to adjust the delimiter argument for the csv.reader() and csv.writer() objects; the default is a comma, but you didn't specify the exact format and perhaps you need to set it to a '\t' tab instead.

If you are using Python 3, open the files with 'r' and 'w' text mode, and set newline='' for both; open(inputfilename, 'r', newline='') and open(filename, 'w', newline='').

Skipping rows starting with specific values while importing a CSV file into R using FREAD

You can read the data with read.csv with fill = TRUE, keep only those rows that have data in date format in date column so values like '<<<<<<< HEAD' or '=======' are removed and use type_convert to change them in respective types.

data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\\d+-\\d+-\\d+', data$date), ]
data <- readr::type_convert(data)
data

# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows

and with data.table::fread you can use blank.lines.skip=TRUE.

data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)

how to copy consecutive and skip n- lines in csv?

Nice challenge. Here is a pure batch solution:

@echo off
setlocal enabledelayedexpansion

REM following code to produce some data for testing:
(
echo date,hour,temp
echo 20181231,24,99
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019010%%a,%%b,!random:~-2!
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019011%%a,%%b,!random:~-2!
for /l %%a in (1,1,9) do @for /l %%b in (1,1,24) do @echo 2019012%%a,%%b,!random:~-2!
)>hourdata-test.csv

REM code to extract desired values
REM expected hour-pairs: 1,2 - 10,11 - 19,20 - 4,5 - 13,14 - 22,23 - 7,8 - 16,17 : repeat

(for /f "tokens=1,* delims=:" %%a in ('findstr /n "^" hourdata-test.csv') do (
set /a "x=%%a %% 9"
if !x! == 3 echo %%b
if !x! == 4 echo %%b
))>ninerdata.csv

The trick is to use the line numbers, calculate Modulo 9 and then simply compare the resulting value. Skipping the first two lines is achieved by printing the modulo numbers 3 and 4.

A full year of data should take less than 2 seconds.

R- import CSV file, all data fall into one (the first) column

Excel, in its English version at least, may use a comma as separator, so you may want to try

x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")

I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists

x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")

In reply to your comment:
Two things you can do. Simplest one is to assign names manually:

myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames

The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector

nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))

then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.

Yet another cruder option is to open your csv file using a text editor and edit column names.

Importing and extracting a random sample from a large .CSV in R

I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table or fread(data.table package)) .

Using perl you can easily do this task. For example , to read 1% of your file in a random way, you can do this :

xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)

Here I am calling it from R using system. xx contain now only 1% of your file.

You can wrap all this in a function:

read_partial_rand <- 
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}

How to read a CSV file every other row

you could read them all into memory with numpy and store every other row:

import numpy as np
import pandas as pd

data = np.loadtxt(filename)
data = pd.DataFrame(data[::2])

The last bit, [::2], means "take every second element".



Related Topics



Leave a reply



Submit