How to select all columns whose names start with X in a pandas DataFrame
Just perform a list comprehension to create your columns:
In [28]:
filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:
df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
Another method is to create a series from the columns and use the vectorised str method startswith
:
In [33]:
df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
In order to achieve what you want you need to add the following to filter the values that don't meet your ==1
criteria:
In [36]:
df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN
EDIT
OK after seeing what you want the convoluted answer is this:
In [72]:
df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Find column whose name contains a specific string
Just iterate over DataFrame.columns
, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns
returns a list of column names[col for col in df.columns if 'spike' in col]
iterates over the listdf.columns
with the variablecol
and adds it to the resulting list ifcol
contains'spike'
. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
Selecting columns with startswith in pandas
Convert to Series
is not necessary, but if want add to another list of columns convert output to list
:
cols = df.columns[df.columns.str.startswith('t')].tolist()
df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})
Another idea is use 2 masks and chain by |
for bitwise OR
:
Notice:
Columns names are filtered from original columns names before rename
in your solution, so is necessary rename later.
m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])
df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
obs treat score tr tk
0 1 0 strong 1 6
1 2 1 weak 2 7
2 3 0 normal 3 8
3 1 1 weak 4 9
4 2 0 strong 5 10
If need rename
first, is necessary reassign back for filter by renamed columns names:
df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]
How to select all columns that start with durations or shape?
You could use str
methods of dataframe startwith
:
df = data[data.columns[data.columns.str.startwith('durations') | data.columns.str.startwith('so')]]
df.fillna(0)
Or you could use contains
method:
df = data.iloc[:, data.columns.str.contains('durations.*'|'shape.*') ]
df.fillna(0)
Get the specified set of columns from pandas dataframe
You can use this as the list of columns to select, so:
df_final = df[[col for col in df if col.startswith('column')]]
The "origin" of the list of strings is of no importance, as long as you pass a list of strings to the subscript, this will normally work.
select columns based on columns names containing a specific string in pandas
alternative methods:
In [13]: df.loc[:, df.columns.str.startswith('alp')]
Out[13]:
alp1 alp2
0 0.357564 0.108907
1 0.341087 0.198098
2 0.416215 0.644166
3 0.814056 0.121044
4 0.382681 0.110829
5 0.130343 0.219829
6 0.110049 0.681618
7 0.949599 0.089632
8 0.047945 0.855116
9 0.561441 0.291182
In [14]: df.loc[:, df.columns.str.contains('alp')]
Out[14]:
alp1 alp2
0 0.357564 0.108907
1 0.341087 0.198098
2 0.416215 0.644166
3 0.814056 0.121044
4 0.382681 0.110829
5 0.130343 0.219829
6 0.110049 0.681618
7 0.949599 0.089632
8 0.047945 0.855116
9 0.561441 0.291182
Selecting columns whose names match regex
If you convert the column Index object to a series, you can use .str
to perform vectorized string operations (like regex searches):
>>> df.columns
Index([u'id', u'0_date', u'0_hr', u'1_date', u'1_hr'], dtype='objec
>>> df.columns.to_series().str
<pandas.core.strings.StringMethods object at 0xa2b56cc>
>>> df.columns.to_series().str.contains("date")
id False
0_date True
0_hr False
1_date True
1_hr False
dtype: bool
>>> df.loc[:, df.columns.to_series().str.contains("date")]
0_date 1_date
1 21-Jan 2-Mar
In this case, I might use endswith
:
>>> df.loc[:, df.columns.to_series().str.endswith("date")]
0_date 1_date
1 21-Jan 2-Mar
(Personally, I think Index objects should grow a .str
which is basically .to_series().str
, to make this a little cleaner.)
python pandas selecting columns from a dataframe via a list of column names
You can remove one []
:
df_new = df[list]
Also better is use other name as list
, e.g. L
:
df_new = df[L]
It look like working, I try only simplify it:
L = []
for x in df.columns:
if not "_" in x[-3:]:
L.append(x)
print (L)
List comprehension
:
print ([x for x in df.columns if not "_" in x[-3:]])
how to select all columns that starts with a common label
First grab the column names with df.columns
, then filter down to just the column names you want .filter(_.startsWith("colF"))
. This gives you an array of Strings. But the select takes select(String, String*)
. Luckily select for columns is select(Column*)
, so finally convert the Strings into Columns with .map(df(_))
, and finally turn the Array of Columns into a var arg with : _*
.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
Drop columns whose name contains a specific string from pandas DataFrame
import pandas as pd
import numpy as np
array=np.random.random((2,4))
df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))
print df
Test1 toto test2 riri
0 0.923249 0.572528 0.845464 0.144891
1 0.020438 0.332540 0.144455 0.741412
cols = [c for c in df.columns if c.lower()[:4] != 'test']
df=df[cols]
print df
toto riri
0 0.572528 0.144891
1 0.332540 0.741412
Related Topics
Executing Multiple Statements with Postgresql via SQLalchemy Does Not Persist Changes
How to Get Around Declaring an Unused Variable in a for Loop
Check If String Has Date, Any Format
How to Qcut with Non Unique Bin Edges
File Not Found Error When Launching a Subprocess Containing Piped Commands
How Does _Contains_ Work for Ndarrays
Getting Rid of Console Output When Freezing Python Programs Using Pyinstaller
How to Write Binary Data to Stdout in Python 3
Imports in _Init_.Py and 'Import As' Statement
How to Check Blas/Lapack Linkage in Numpy and Scipy
How to Solve Readtimeouterror: Httpsconnectionpool(Host='Pypi.Python.Org', Port=443) with Pip
Pyserial Non-Blocking Read Loop
Why Don't Methods Have Reference Equality
How to Set Default Python Version to Python3 in Ubuntu
How to Patch a Python Decorator Before It Wraps a Function