Find Column Name in Pandas That Matches an Array

Find column name in pandas that matches an array

Approach #1

Here's one vectorized approach leveraging NumPy broadcasting -

df.columns[(df.values == np.asarray(x)[:,None]).all(0)]

Sample run -

In [367]: df
Out[367]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6

In [368]: x = df.iloc[:,2].values.tolist()

In [369]: x
Out[369]: [2, 3, 2, 4, 0]

In [370]: df.columns[(df.values == np.asarray(x)[:,None]).all(0)]
Out[370]: Int64Index([2], dtype='int64')

Approach #2

Alternatively, here's another using the concept of views -

def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()

df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])
out = np.flatnonzero(df1D_arr==x1D)

Sample run -

In [442]: df
Out[442]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6

In [443]: x = df.iloc[:,5].values.tolist()

In [444]: df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])

In [445]: np.flatnonzero(df1D_arr==x1D)
Out[445]: array([5])

Find column names when row element meets a criteria Pandas

You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:

(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']

Or:

df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']

Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:

df.sum()[lambda x: x == 0].index.tolist()
# ['d']

Check rows:

df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []

Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.

Retrieve name of column from its Index in Pandas

I think you need index columns names by position (python counts from 0, so for fourth column need 3):

colname = df.columns[pos]

Sample:

df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})

print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3

pos = 3
colname = df.columns[pos]
print (colname)
D

pos = [3,5]
colname = df.columns[pos]
print (colname)
Index(['D', 'F'], dtype='object')

pandas to match columns names based on a list object based check

Use -

print(df[[i for i in matchObj if i in df.columns]])

Output

   equity01  equity02
0 1 4
1 2 5
2 3 6

Explanation

[i for i in matchObj if i in df.columns] only fetches columns that are present in df. Ignores all the rest. Hope that helps.

Python match a column name based on a column value in another dataframe

It looks like you can do a map:

df_full['quantile_05'] = df_full['Industry'].map(df_industry['profit_sales'].unstack()[0.5])

Output:

             Industry  quantile_05
INDEX
0 Service 0.003375
1 Service 0.003375
2 Trade 0.001715
3 Service 0.003375
4 Manufacturing 0.002032

If you want all three quantiles, you can do a merge as suggested by Kyle:

df_full.merge(df_industry['profit_sales'].unstack(),
left_on=['Industry'],
right_index=True,
how='left')

Output:

             Industry      0.25       0.5      0.75
INDEX
0 Service -0.012660 0.003375 0.064102
1 Service -0.012660 0.003375 0.064102
2 Trade NaN 0.001715 0.018705
3 Service -0.012660 0.003375 0.064102
4 Manufacturing -0.012373 0.002032 0.010331

Get column index from column name in python pandas

Sure, you can use .get_loc():

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

Get column name where value is something in pandas dataframe

Here is one, perhaps inelegant, way to do it:

df_result = pd.DataFrame(ts, columns=['value'])

Set up a function which grabs the column name which contains the value (from ts):

def get_col_name(row):    
b = (df.ix[row.name] == row['value'])
return b.index[b.argmax()]

for each row, test which elements equal the value, and extract column name of a True.

And apply it (row-wise):

In [3]: df_result.apply(get_col_name, axis=1)
Out[3]:
1979-01-01 00:00:00 col5
1979-01-01 06:00:00 col3
1979-01-01 12:00:00 col1
1979-01-01 18:00:00 col1

i.e. use df_result['Column'] = df_result.apply(get_col_name, axis=1).

.

Note: there is quite a lot going on in get_col_name so perhaps it warrants some further explanation:

In [4]: row = df_result.irow(0) # an example row to pass to get_col_name

In [5]: row
Out[5]:
value 1181.220328
Name: 1979-01-01 00:00:00

In [6]: row.name # use to get rows of df
Out[6]: <Timestamp: 1979-01-01 00:00:00>

In [7]: df.ix[row.name]
Out[7]:
col5 1181.220328
col4 912.154923
col3 648.848635
col2 390.986156
col1 138.185861
Name: 1979-01-01 00:00:00

In [8]: b = (df.ix[row.name] == row['value'])
#checks whether each elements equal row['value'] = 1181.220328

In [9]: b
Out[9]:
col5 True
col4 False
col3 False
col2 False
col1 False
Name: 1979-01-01 00:00:00

In [10]: b.argmax() # index of a True value
Out[10]: 0

In [11]: b.index[b.argmax()] # the index value (column name)
Out[11]: 'col5'

It might be there is more efficient way to do this...

Find the column name which has the maximum value for each row

You can use idxmax with axis=1 to find the column with the greatest value on each row:

>>> df.idxmax(axis=1)
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object

To create the new column 'Max', use df['Max'] = df.idxmax(axis=1).

To find the row index at which the maximum value occurs in each column, use df.idxmax() (or equivalently df.idxmax(axis=0)).

How to select all columns whose names start with X in a pandas DataFrame

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0

In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0


Related Topics



Leave a reply



Submit