pandas read_csv and filter columns with usecols
The solution lies in understanding these two keyword arguments:
- names is only necessary when there is no header row in your file and you want to specify other arguments (such as
usecols
) using column names rather than integer indices. - usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.
header=0
is sufficient and additionally passing names
appears to be confusing pd.read_csv
.Removing names
from the second call gives the desired output:
import pandas as pd
from StringIO import StringIO
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
df = pd.read_csv(StringIO(csv),
header=0,
index_col=["date", "loc"],
usecols=["date", "loc", "x"],
parse_dates=["date"])
Which gives us: x
date loc
2009-01-01 a 1
2009-01-02 a 3
2009-01-03 a 5
2009-01-01 b 1
2009-01-02 b 3
2009-01-03 b 5
Pandas read csv usecols= wont allow filtering by string and index at same time
For me working callable by function:
f = lambda x: 'M6A=F' in x or 'Unnamed: 0' in x
df = pd.read_csv('hist1day.csv', usecols=f, index_col=[0], skiprows=[2])
print(df)
M6A=F M6A=F.1 M6A=F.2 \
NaN Open High Low
2009-03-20 0.6830999851226807 0.6830999851226807 0.6830999851226807
2009-03-23 0.6995000243186951 0.6995000243186951 0.6995000243186951
2009-03-24 0.6998000144958496 0.7056999802589417 0.6912999749183655
2009-03-25 0.6941999793052673 0.7014999985694885 0.6894999742507935
... ... ...
2022-08-04 0.6949999928474426 0.699400007724762 0.6938999891281128
2022-08-05 0.6972000002861023 0.6978999972343445 0.6873999834060669
2022-08-08 0.6899999976158142 0.7013000249862671 0.6899999976158142
2022-08-09 0.6987000107765198 0.6998000144958496 0.6956999897956848
2022-08-10 0.7081999778747559 0.7091000080108643 0.7080000042915344
M6A=F.3 M6A=F.4 M6A=F.5
NaN Close Adj Close Volume
2009-03-20 0.6830999851226807 0.6830999851226807 0.0
2009-03-23 0.6995000243186951 0.6995000243186951 392.0
2009-03-24 0.6963000297546387 0.6963000297546387 588.0
2009-03-25 0.6901999711990356 0.6901999711990356 616.0
... ... ...
2022-08-04 0.6985999941825867 0.6985999941825867 6604.0
2022-08-05 0.691100001335144 0.691100001335144 9935.0
2022-08-08 0.6984000205993652 0.6984000205993652 8102.0
2022-08-09 0.695900022983551 0.695900022983551 8102.0
2022-08-10 0.7085000276565552 0.7085000276565552 305.0
[3373 rows x 6 columns]
How to select specific columns from read_csv which start with specific word?
You read the file twice: once for the headers only and once for the actual data:
df = pd.read_csv('data.csv', usecols=lambda col: col.startswith('A_') or col.startswith('X_'))
Issue trying to read and filter columns using pd.read_csv when columns have same names
Since you're reading a csv-file that was produced from a dataframe with multi-index columns, you have to take that into account when reading it back into a dataframe.
Try something like the following:
from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()
tickers = "MYM=F M6A=F"
hist_data = pdr.get_data_yahoo(
tickers, period="1mo", interval="5m", prepost=True, group_by="ticker"
)
# Writing to csv-file
hist_data.to_csv("hist_data.csv")
# Reading back from csv-file
hist_data = pd.read_csv("hist_data.csv", index_col=0, header=[0, 1])
# Selecting the M6A=F/Volume-column:
volume = hist_data[[("M6A=F", "Volume")]]
print(volume)
The first change is to set an index column by using index_col=0
(obviously the first here). And the second, header=[0, 1]
, is to make sure that the first 2 rows are used to build the multi-index columns. See here:header : int, list of int, None, default ‘infer’Result:... The header can be a list of integers that specify row locations for a multi-index on the columns ...
M6A=F
Volume
Datetime
2022-06-06 09:40:00-04:00 0.0
2022-06-06 09:45:00-04:00 67.0
2022-06-06 09:50:00-04:00 36.0
2022-06-06 09:55:00-04:00 18.0
2022-06-06 10:00:00-04:00 61.0
... ...
2022-07-06 09:20:00-04:00 47.0
2022-07-06 09:25:00-04:00 12.0
2022-07-06 09:30:00-04:00 7.0
2022-07-06 09:31:10-04:00 0.0
2022-07-06 09:31:20-04:00 NaN
[6034 rows x 1 columns]
(I've used double brackets here hist_data[[("M6A=F", "Volume")]]
to get a dataframe that shows the column label. If you don't need that, use single brackets hist_data[("M6A=F", "Volume")]
etc.) How can I filter lines on load in Pandas read_csv function?
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details. Is there a way to read required AND optional columns with usecols in pandas?
I'm still unsure if this specific question has an answer within the realms of the pandas API, but I've figured out a way to solve the problem. I've mentioned in the comments that I am trying to run a for loop reading in different datasets running this condition for usecols
.
As @Michael S. suggested, try
and except
is the way to go here.
First, run through the dataset with the required columns, and then append the dataset name to a list.
# list of required columns
requiredList = ['required_col1', 'required_col2']
finalDatasets = []
for i, dataName in enumerate(database)):
try:
data = pd.read_csv(dataName, usecols=requiredList)
except:
continue
else:
finalDatasets.append(dataName)
Since this new list has gotten rid of the datasets that do not have the required columns, you can now run the lambda
function to parse through optional columns:columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
for i, dataName in enumerate(finalDatasets):
# read in data
try:
data = pd.read_csv(dataName, usecols=lambda x: x in columnsList)
except:
continue
else:
{insert analysis}
This way, you are able to parse through the datasets that have the required columns, but you are unsure if they have optional columns (this is where lambda
comes in handy). pandas: read csv filtering columns by old name and renaming them at the same time
You can use positional indexing in usecols
, so if you know the positions in your file you could do:
import pandas as pd
df = pd.read_csv("file.csv", usecols=[2,6], names=["two", "six"])
Read specific columns with pandas or other python module
An easy way to do this is using the pandas
library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace
which remove the spaces in the header. So ' star_name' becomes 'star_name'
Related Topics
Python: Why Does ("Hello" Is "Hello") Evaluate as True
How to Change a Widget's Font Style Without Knowing the Widget's Font Family/Size
Find All Upper, Lower and Mixed Case Combinations of a String
Serve Image Stored in SQLalchemy Largebinary Column
Suppressing Scientific Notation in Pandas
How to Overlay Two Graphs in Seaborn
Improve Current Implementation of a Setinterval
_Getattr_ for Static/Class Variables
Converting a List of Tuples into a Dict
Python-Pandas: the Truth Value of a Series Is Ambiguous
How to Wrap a String in a File in Python
Pandas Expand Rows from List Data Available in Column
Csvwriter Not Saving Data to File the Moment I Write It
Python & Selenium: Difference Between Driver.Implicitly_Wait() and Time.Sleep()
Splitting a String by List of Indices