How to Select All Columns Whose Names Start with X in a Pandas Dataframe

How to select all columns whose names start with X in a pandas DataFrame

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0

Find column whose name contains a specific string

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

  1. df.columns returns a list of column names
  2. [col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0 1 7
1 2 8
2 3 9

Selecting columns with startswith in pandas

Convert to Series is not necessary, but if want add to another list of columns convert output to list:

cols =  df.columns[df.columns.str.startswith('t')].tolist()

df = df[['score','obs'] + cols].rename(columns = {'treatment':'treat'})

Another idea is use 2 masks and chain by | for bitwise OR:

Notice:

Columns names are filtered from original columns names before rename in your solution, so is necessary rename later.

m1 = df.columns.str.startswith('t')
m2 = df.columns.isin(['score','obs'])

df = df.loc[:, m1 | m2].rename(columns = {'treatment':'treat'})
print (df)
obs treat score tr tk
0 1 0 strong 1 6
1 2 1 weak 2 7
2 3 0 normal 3 8
3 1 1 weak 4 9
4 2 0 strong 5 10

If need rename first, is necessary reassign back for filter by renamed columns names:

df = df.rename(columns = {'treatment':'treat'})
df = df.loc[:, df.columns.str.startswith('t') | df.columns.isin(['score','obs'])]

How to select all columns that start with durations or shape?

You could use str methods of dataframe startwith:

df = data[data.columns[data.columns.str.startwith('durations') | data.columns.str.startwith('so')]]
df.fillna(0)

Or you could use contains method:

df = data.iloc[:, data.columns.str.contains('durations.*'|'shape.*') ]
df.fillna(0)

Get the specified set of columns from pandas dataframe

You can use this as the list of columns to select, so:

df_final = df[[col for col in df if col.startswith('column')]]

The "origin" of the list of strings is of no importance, as long as you pass a list of strings to the subscript, this will normally work.

select columns based on columns names containing a specific string in pandas

alternative methods:

In [13]: df.loc[:, df.columns.str.startswith('alp')]
Out[13]:
alp1 alp2
0 0.357564 0.108907
1 0.341087 0.198098
2 0.416215 0.644166
3 0.814056 0.121044
4 0.382681 0.110829
5 0.130343 0.219829
6 0.110049 0.681618
7 0.949599 0.089632
8 0.047945 0.855116
9 0.561441 0.291182

In [14]: df.loc[:, df.columns.str.contains('alp')]
Out[14]:
alp1 alp2
0 0.357564 0.108907
1 0.341087 0.198098
2 0.416215 0.644166
3 0.814056 0.121044
4 0.382681 0.110829
5 0.130343 0.219829
6 0.110049 0.681618
7 0.949599 0.089632
8 0.047945 0.855116
9 0.561441 0.291182

Selecting columns whose names match regex

If you convert the column Index object to a series, you can use .str to perform vectorized string operations (like regex searches):

>>> df.columns
Index([u'id', u'0_date', u'0_hr', u'1_date', u'1_hr'], dtype='objec
>>> df.columns.to_series().str
<pandas.core.strings.StringMethods object at 0xa2b56cc>
>>> df.columns.to_series().str.contains("date")
id False
0_date True
0_hr False
1_date True
1_hr False
dtype: bool
>>> df.loc[:, df.columns.to_series().str.contains("date")]
0_date 1_date
1 21-Jan 2-Mar

In this case, I might use endswith:

>>> df.loc[:, df.columns.to_series().str.endswith("date")]
0_date 1_date
1 21-Jan 2-Mar

(Personally, I think Index objects should grow a .str which is basically .to_series().str, to make this a little cleaner.)

python pandas selecting columns from a dataframe via a list of column names

You can remove one []:

df_new = df[list]

Also better is use other name as list, e.g. L:

df_new = df[L]

It look like working, I try only simplify it:

L = []
for x in df.columns:
if not "_" in x[-3:]:
L.append(x)
print (L)

List comprehension:

print ([x for x in df.columns if not "_" in x[-3:]])

how to select all columns that starts with a common label

First grab the column names with df.columns, then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings. But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)), and finally turn the Array of Columns into a var arg with : _*.

df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show

This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):

df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 

If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.

df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

Drop columns whose name contains a specific string from pandas DataFrame

import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

Test1 toto test2 riri
0 0.923249 0.572528 0.845464 0.144891
1 0.020438 0.332540 0.144455 0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
toto riri
0 0.572528 0.144891
1 0.332540 0.741412


Related Topics



Leave a reply



Submit