How to Filter Lines on Load in Pandas Read_CSV Function

How can I filter lines on load in Pandas read_csv function?

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

How do I Filter a Pandas DataFrame After using read_csv() or read_excel()

Try this:

pd.read_csv('file.csv').query('Age >= 21')

Filter rows from a CSV that has only beginning or starting quotes, but dont have end quote for a column

I tried with on_bad_lines='warn', while reading the CSV, and now I get a warning and also the row is skipped. Hope this helps.

df = pd.read_csv('c:\\bad_data.txt', on_bad_lines='warn', delimiter='|', engine='python')
display(df)

Output:
Sample Image

If you dont want the warning, you can use SKIP.

Also if you would prefer to store the warning messages to a file, then we can use redirect_stderr. Am providing a sample just in case anyone is looking for it.

import pandas as pd
from contextlib import redirect_stderr
import io
errorlist=[]
# Redirect stderr to something we can report on.
f = io.StringIO()
# with redirect_stderr(f):
with open('c:\\errors.txt', 'w') as stderr, redirect_stderr(stderr):
df = pd.read_csv('c:\\bad_data.txt', on_bad_lines='warn', delimiter='|', engine='python')
if f.getvalue():
errorlist.append("Line skipped due to parsing error : {}".format(f.getvalue()))

Thanks.

How to read one line then jumps three, while in pandas.read_csv(), until the end of the text file?

I agree with @sammywemmy's comment about csv module for more granular control, it will make things WAY easier to modify and customize. That said, the callable option did intrigue me, so here is how you could do it. (Note that we have to add edge cases to pick up the column header and first row.)

>>> def skip_func(x):
... if x in (0, 1):
... return False
... else:
... return (x-1) % 4 != 0

>>> pd.read_csv(filepath, skiprows=skip_func)

Python read csv file and filter data

The fields of a csv row are strings so you need int(row[1]) to work correctly. I also recommend a list comprehension for the filtering, or pandas for speed. next(csv_reader) will read one row to capture the headers as well.

Note: use newline='' with the csv module as documented to avoid blank lines between each row.

import csv

alpha_min = 110
alpha_max = 125

with open('test.csv','r',newline='') as input_file:
csv_reader = csv.reader(input_file)
header = next(csv_reader)
results = [row for row in csv_reader if alpha_min < int(row[1]) < alpha_max]

with open('output.csv','w',newline='') as output_file:
csv_writer = csv.writer(output_file)
csv_writer.writerow(header)
csv_writer.writerows(results)

Python: Read Pandas Dataframe from csv File, Make Filtered Output to Another File as csv

Don't attempt to write each line individually, dataframes have to_csv method.

df = pd.read_csv('input.csv')
# some filtering logic, for example:
filtered_df = df[df['col a'] == 2]
filtered_df.to_csv('output.csv')

pandas read_csv and keep only certain rows (python)

I think you would need to find the number of lines first, like this.

num_lines = sum(1 for line in open('myfile.txt'))

Then you would need to delete the indices of index_list:

to_exclude = [i for i in num_lines if i not in index_list]

and then load your data:

pd.read_csv(path, skiprows = to_exclude)


Related Topics



Leave a reply



Submit