Conditional Row Read of CSV in Pandas

Speed up this conditional row read of csv file in Pandas?

We could initially read just the specific column we want to filter on with the above conditions (assuming this reduces the reading overhead significantly) .

#reading the mask column
df_indx = (pd.read_csv(filename, error_bad_lines=False,usecols=['Accident_Index'])
[lambda x: x['Accident_Index'].str.startswith('2005')])

We could then use the values from this column to read the remaining columns from the file using the skiprows and nrows properties since they are sorted values in the input file

df_data= (pd.read_csv(filename,    
error_bad_lines=False,header=0,skiprows=df_indx.index[0],nrows=df_indx.shape[0]))
df_data.columns=['Accident_index','data']

This would give a subset of the data we want. We may not need to get the column names separately.

Sample Image

Pandas read_csv skiprows with conditional statements

No. skiprows will not allow you to drop based on the row content/value.

Based on Pandas Documentation:

skiprows : list-like, int or callable, optional

Line numbers to skip (0-indexed) or
number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2]
.

Pandas read_csv() conditionally skipping header row

If the headers in your CSV files follow a similar pattern, you can do something simple like sniffing out the first line before determining whether to skip the first row or not.

filename = '/path/to/file.csv'
skiprows = int('Created in' in next(open(filename)))
df = pd.read_csv(filename, skiprows=skiprows)

Good pratice would be to use a context manager, so you could also do this:

filename = '/path/to/file.csv'
skiprows = 0
with open(filename, 'r+') as f:
for line in f:
if line.startswith('Created '):
skiprows = 1
break
df = pd.read_csv(filename, skiprows=skiprows)

Reading csv row-wise and matching conditions in Python

import csv

seen = set()
with open("tmp.csv", "r") as f:
for line in csv.reader(f, delimiter=","):
if line in seen:
break
else:
seen.add(line)

Depending on what you are looking to do, you might also find this approach useful: How can I filter lines on load in Pandas read_csv function?

How to read specific rows and columns, which satisfy some condition, from file while initializing a dataframe in Pandas?

There's no direct/easy way of doing that (that i know of)!

The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :

columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]

and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.

However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!



Related Topics



Leave a reply



Submit