How to Select All Columns Whose Names Start with X in a Pandas Dataframe

How to select all columns whose names start with X in a pandas DataFrame

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Find column whose name contains a specific string

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

Selecting columns with startswith in pandas

Convert to Series is not necessary, but if want add to another list of columns convert output to list:

cols =  df.columns[df.columns.str.startswith('t')].tolist()

df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})

Another idea is use 2 masks and chain by | for bitwise OR:

Notice:

Columns names are filtered from original columns names before rename in your solution, so is necessary rename later.

m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])

df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
   obs  treat   score  tr  tk
0    1      0  strong   1   6
1    2      1    weak   2   7
2    3      0  normal   3   8
3    1      1    weak   4   9
4    2      0  strong   5  10

If need rename first, is necessary reassign back for filter by renamed columns names:

df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]

How to select all columns that start with durations or shape?

You could use str methods of dataframe startwith:

df = data[data.columns[data.columns.str.startwith('durations') | data.columns.str.startwith('so')]]
df.fillna(0)

Or you could use contains method:

df = data.iloc[:, data.columns.str.contains('durations.*'|'shape.*') ]
df.fillna(0)

Get the specified set of columns from pandas dataframe

You can use this as the list of columns to select, so:

df_final = df[[col for col in df if col.startswith('column')]]

The "origin" of the list of strings is of no importance, as long as you pass a list of strings to the subscript, this will normally work.

select columns based on columns names containing a specific string in pandas

alternative methods:

In [13]: df.loc[:, df.columns.str.startswith('alp')]
Out[13]:
       alp1      alp2
0  0.357564  0.108907
1  0.341087  0.198098
2  0.416215  0.644166
3  0.814056  0.121044
4  0.382681  0.110829
5  0.130343  0.219829
6  0.110049  0.681618
7  0.949599  0.089632
8  0.047945  0.855116
9  0.561441  0.291182

In [14]: df.loc[:, df.columns.str.contains('alp')]
Out[14]:
       alp1      alp2
0  0.357564  0.108907
1  0.341087  0.198098
2  0.416215  0.644166
3  0.814056  0.121044
4  0.382681  0.110829
5  0.130343  0.219829
6  0.110049  0.681618
7  0.949599  0.089632
8  0.047945  0.855116
9  0.561441  0.291182

Selecting columns whose names match regex

If you convert the column Index object to a series, you can use .str to perform vectorized string operations (like regex searches):

>>> df.columns
Index([u'id', u'0_date', u'0_hr', u'1_date', u'1_hr'], dtype='objec
>>> df.columns.to_series().str
<pandas.core.strings.StringMethods object at 0xa2b56cc>
>>> df.columns.to_series().str.contains("date")
id        False
0_date     True
0_hr      False
1_date     True
1_hr      False
dtype: bool
>>> df.loc[:, df.columns.to_series().str.contains("date")]
   0_date 1_date
1  21-Jan  2-Mar

In this case, I might use endswith:

>>> df.loc[:, df.columns.to_series().str.endswith("date")]
   0_date 1_date
1  21-Jan  2-Mar

(Personally, I think Index objects should grow a .str which is basically .to_series().str, to make this a little cleaner.)

python pandas selecting columns from a dataframe via a list of column names

You can remove one []:

df_new = df[list]

Also better is use other name as list, e.g. L:

df_new = df[L]

It look like working, I try only simplify it:

L = []
for x in df.columns: 
    if not "_" in x[-3:]: 
        L.append(x) 
print (L)

List comprehension:

print ([x for x in df.columns if not "_" in x[-3:]])

how to select all columns that starts with a common label

First grab the column names with df.columns, then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)), and finally turn the Array of Columns into a var arg with : _*.

df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show

This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):

df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show

If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.

df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

Drop columns whose name contains a specific string from pandas DataFrame

import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412

How to Select All Columns Whose Names Start with X in a Pandas Dataframe