How can I filter lines on load in Pandas read_csv function?
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
How do I Filter a Pandas DataFrame After using read_csv() or read_excel()
Try this:
pd.read_csv('file.csv').query('Age >= 21')
Filter rows from a CSV that has only beginning or starting quotes, but dont have end quote for a column
I tried with on_bad_lines='warn', while reading the CSV, and now I get a warning and also the row is skipped. Hope this helps.
df = pd.read_csv('c:\\bad_data.txt', on_bad_lines='warn', delimiter='|', engine='python')
display(df)
Output:
If you dont want the warning, you can use SKIP.
Also if you would prefer to store the warning messages to a file, then we can use redirect_stderr. Am providing a sample just in case anyone is looking for it.
import pandas as pd
from contextlib import redirect_stderr
import io
errorlist=[]
# Redirect stderr to something we can report on.
f = io.StringIO()
# with redirect_stderr(f):
with open('c:\\errors.txt', 'w') as stderr, redirect_stderr(stderr):
df = pd.read_csv('c:\\bad_data.txt', on_bad_lines='warn', delimiter='|', engine='python')
if f.getvalue():
errorlist.append("Line skipped due to parsing error : {}".format(f.getvalue()))
Thanks.
How to read one line then jumps three, while in pandas.read_csv(), until the end of the text file?
I agree with @sammywemmy's comment about csv module for more granular control, it will make things WAY easier to modify and customize. That said, the callable option did intrigue me, so here is how you could do it. (Note that we have to add edge cases to pick up the column header and first row.)
>>> def skip_func(x):
... if x in (0, 1):
... return False
... else:
... return (x-1) % 4 != 0
>>> pd.read_csv(filepath, skiprows=skip_func)
Python read csv file and filter data
The fields of a csv
row are strings so you need int(row[1])
to work correctly. I also recommend a list comprehension for the filtering, or pandas
for speed. next(csv_reader)
will read one row to capture the headers as well.
Note: use newline=''
with the csv
module as documented to avoid blank lines between each row.
import csv
alpha_min = 110
alpha_max = 125
with open('test.csv','r',newline='') as input_file:
csv_reader = csv.reader(input_file)
header = next(csv_reader)
results = [row for row in csv_reader if alpha_min < int(row[1]) < alpha_max]
with open('output.csv','w',newline='') as output_file:
csv_writer = csv.writer(output_file)
csv_writer.writerow(header)
csv_writer.writerows(results)
Python: Read Pandas Dataframe from csv File, Make Filtered Output to Another File as csv
Don't attempt to write each line individually, dataframes have to_csv
method.
df = pd.read_csv('input.csv')
# some filtering logic, for example:
filtered_df = df[df['col a'] == 2]
filtered_df.to_csv('output.csv')
pandas read_csv and keep only certain rows (python)
I think you would need to find the number of lines first, like this.
num_lines = sum(1 for line in open('myfile.txt'))
Then you would need to delete the indices of index_list
:
to_exclude = [i for i in num_lines if i not in index_list]
and then load your data:
pd.read_csv(path, skiprows = to_exclude)
Related Topics
Concatenating Two One-Dimensional Numpy Arrays
Django: Deploying an Application on Heroku with SQLite3 as the Database
Tensorflow Different Ways to Export and Run Graph in C++
Plotting 3-Tuple Data Points in a Surface/Contour Plot Using Matplotlib
How to Integrate a Standalone Python Script into a Rails Application
How to Redirect Stdout to Both File and Console with Scripting
Performing a Getattr() Style Lookup in a Django Template
Python Sockets Error Typeerror: a Bytes-Like Object Is Required, Not 'Str' with Send Function
Where Is a Complete Example of Logging.Config.Dictconfig
Pandas Groupby Range of Values
Displaying Subprocess Output to Stdout and Redirecting It
In Python, How to Capture the Stdout from a C++ Shared Library to a Variable
R Markdown: How to Make Rstudio Display Python Plots Inline Instead of in New Window
Does Python Have an "Or Equals" Function Like ||= in Ruby
Please Introduce a Multi-Processing Library in Perl or Ruby
How to Access the Request Object or Any Other Variable in a Form's Clean() Method
Typeerror: Can't Use a String Pattern on a Bytes-Like Object in Re.Findall()