Pandas Read_CSV and Filter Columns with Usecols

pandas read_csv and filter columns with usecols

The solution lies in understanding these two keyword arguments:

names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

Pandas read csv usecols= wont allow filtering by string and index at same time

For me working callable by function:

f = lambda x: 'M6A=F' in x or 'Unnamed: 0' in x
df = pd.read_csv('hist1day.csv', usecols=f, index_col=[0], skiprows=[2])
print(df)
                         M6A=F             M6A=F.1             M6A=F.2  \
NaN                       Open                High                 Low   
2009-03-20  0.6830999851226807  0.6830999851226807  0.6830999851226807   
2009-03-23  0.6995000243186951  0.6995000243186951  0.6995000243186951   
2009-03-24  0.6998000144958496  0.7056999802589417  0.6912999749183655   
2009-03-25  0.6941999793052673  0.7014999985694885  0.6894999742507935   
                       ...                 ...                 ...   
2022-08-04  0.6949999928474426   0.699400007724762  0.6938999891281128   
2022-08-05  0.6972000002861023  0.6978999972343445  0.6873999834060669   
2022-08-08  0.6899999976158142  0.7013000249862671  0.6899999976158142   
2022-08-09  0.6987000107765198  0.6998000144958496  0.6956999897956848   
2022-08-10  0.7081999778747559  0.7091000080108643  0.7080000042915344   

                       M6A=F.3             M6A=F.4 M6A=F.5  
NaN                      Close           Adj Close  Volume  
2009-03-20  0.6830999851226807  0.6830999851226807     0.0  
2009-03-23  0.6995000243186951  0.6995000243186951   392.0  
2009-03-24  0.6963000297546387  0.6963000297546387   588.0  
2009-03-25  0.6901999711990356  0.6901999711990356   616.0  
                       ...                 ...     ...  
2022-08-04  0.6985999941825867  0.6985999941825867  6604.0  
2022-08-05   0.691100001335144   0.691100001335144  9935.0  
2022-08-08  0.6984000205993652  0.6984000205993652  8102.0  
2022-08-09   0.695900022983551   0.695900022983551  8102.0  
2022-08-10  0.7085000276565552  0.7085000276565552   305.0  

[3373 rows x 6 columns]

How to select specific columns from read_csv which start with specific word?

You read the file twice: once for the headers only and once for the actual data:

df = pd.read_csv('data.csv', usecols=lambda col: col.startswith('A_') or col.startswith('X_'))

Issue trying to read and filter columns using pd.read_csv when columns have same names

Since you're reading a csv-file that was produced from a dataframe with multi-index columns, you have to take that into account when reading it back into a dataframe.

Try something like the following:

from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()

tickers = "MYM=F M6A=F"
hist_data = pdr.get_data_yahoo(
    tickers, period="1mo", interval="5m", prepost=True, group_by="ticker"
)

# Writing to csv-file
hist_data.to_csv("hist_data.csv")

# Reading back from csv-file
hist_data = pd.read_csv("hist_data.csv", index_col=0, header=[0, 1])

# Selecting the M6A=F/Volume-column:
volume = hist_data[[("M6A=F", "Volume")]]
print(volume)

The first change is to set an index column by using index_col=0 (obviously the first here). And the second, header=[0, 1], is to make sure that the first 2 rows are used to build the multi-index columns. See here:

header : int, list of int, None, default ‘infer’
... The header can be a list of integers that specify row locations for a multi-index on the columns ...

Result:

                           M6A=F
                          Volume
Datetime                        
2022-06-06 09:40:00-04:00    0.0
2022-06-06 09:45:00-04:00   67.0
2022-06-06 09:50:00-04:00   36.0
2022-06-06 09:55:00-04:00   18.0
2022-06-06 10:00:00-04:00   61.0
...                          ...
2022-07-06 09:20:00-04:00   47.0
2022-07-06 09:25:00-04:00   12.0
2022-07-06 09:30:00-04:00    7.0
2022-07-06 09:31:10-04:00    0.0
2022-07-06 09:31:20-04:00    NaN

[6034 rows x 1 columns]

(I've used double brackets here hist_data[[("M6A=F", "Volume")]] to get a dataframe that shows the column label. If you don't need that, use single brackets hist_data[("M6A=F", "Volume")] etc.)

How can I filter lines on load in Pandas read_csv function?

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

Is there a way to read required AND optional columns with usecols in pandas?

I'm still unsure if this specific question has an answer within the realms of the pandas API, but I've figured out a way to solve the problem. I've mentioned in the comments that I am trying to run a for loop reading in different datasets running this condition for usecols.

As @Michael S. suggested, try and except is the way to go here.

First, run through the dataset with the required columns, and then append the dataset name to a list.

# list of required columns
requiredList = ['required_col1', 'required_col2']
finalDatasets = []
for i, dataName in enumerate(database)):
    try:
        data = pd.read_csv(dataName, usecols=requiredList)
    except:
        continue
    else:
        finalDatasets.append(dataName)

Since this new list has gotten rid of the datasets that do not have the required columns, you can now run the lambda function to parse through optional columns:

columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
for i, dataName in enumerate(finalDatasets):
    # read in data
    try:
        data = pd.read_csv(dataName, usecols=lambda x: x in columnsList)
    except:
        continue
    else:
        {insert analysis}

This way, you are able to parse through the datasets that have the required columns, but you are unsure if they have optional columns (this is where lambda comes in handy).

pandas: read csv filtering columns by old name and renaming them at the same time

You can use positional indexing in usecols, so if you know the positions in your file you could do:

import pandas as pd

df = pd.read_csv("file.csv", usecols=[2,6], names=["two", "six"])

Read specific columns with pandas or other python module

An easy way to do this is using the pandas library like this.

import pandas as pd
fields = ['star_name', 'ra']

df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name

The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'

Pandas Read_CSV and Filter Columns with Usecols