Find column name in pandas that matches an array
Approach #1
Here's one vectorized approach leveraging NumPy broadcasting
-
df.columns[(df.values == np.asarray(x)[:,None]).all(0)]
Sample run -
In [367]: df
Out[367]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6
In [368]: x = df.iloc[:,2].values.tolist()
In [369]: x
Out[369]: [2, 3, 2, 4, 0]
In [370]: df.columns[(df.values == np.asarray(x)[:,None]).all(0)]
Out[370]: Int64Index([2], dtype='int64')
Approach #2
Alternatively, here's another using the concept of views
-
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])
out = np.flatnonzero(df1D_arr==x1D)
Sample run -
In [442]: df
Out[442]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6
In [443]: x = df.iloc[:,5].values.tolist()
In [444]: df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])
In [445]: np.flatnonzero(df1D_arr==x1D)
Out[445]: array([5])
Find column names when row element meets a criteria Pandas
You can use a lambda
expression to filter series and if you want a list instead of index as result, you can call .tolist()
on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols
and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda
expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Retrieve name of column from its Index in Pandas
I think you need index columns names by position (python counts from 0
, so for fourth column need 3
):
colname = df.columns[pos]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
pos = 3
colname = df.columns[pos]
print (colname)
D
pos = [3,5]
colname = df.columns[pos]
print (colname)
Index(['D', 'F'], dtype='object')
pandas to match columns names based on a list object based check
Use -
print(df[[i for i in matchObj if i in df.columns]])
Output
equity01 equity02
0 1 4
1 2 5
2 3 6
Explanation
[i for i in matchObj if i in df.columns]
only fetches columns that are present in df
. Ignores all the rest. Hope that helps.
Python match a column name based on a column value in another dataframe
It looks like you can do a map:
df_full['quantile_05'] = df_full['Industry'].map(df_industry['profit_sales'].unstack()[0.5])
Output:
Industry quantile_05
INDEX
0 Service 0.003375
1 Service 0.003375
2 Trade 0.001715
3 Service 0.003375
4 Manufacturing 0.002032
If you want all three quantiles, you can do a merge
as suggested by Kyle:
df_full.merge(df_industry['profit_sales'].unstack(),
left_on=['Industry'],
right_index=True,
how='left')
Output:
Industry 0.25 0.5 0.75
INDEX
0 Service -0.012660 0.003375 0.064102
1 Service -0.012660 0.003375 0.064102
2 Trade NaN 0.001715 0.018705
3 Service -0.012660 0.003375 0.064102
4 Manufacturing -0.012373 0.002032 0.010331
Get column index from column name in python pandas
Sure, you can use .get_loc()
:
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"]
, df[["apple", "orange"]]
, or maybe df.columns.isin(["orange", "pear"])
), although I can definitely see cases where you'd want the index number.
Get column name where value is something in pandas dataframe
Here is one, perhaps inelegant, way to do it:
df_result = pd.DataFrame(ts, columns=['value'])
Set up a function which grabs the column name which contains the value (from ts
):
def get_col_name(row):
b = (df.ix[row.name] == row['value'])
return b.index[b.argmax()]
for each row, test which elements equal the value, and extract column name of a True.
And apply
it (row-wise):
In [3]: df_result.apply(get_col_name, axis=1)
Out[3]:
1979-01-01 00:00:00 col5
1979-01-01 06:00:00 col3
1979-01-01 12:00:00 col1
1979-01-01 18:00:00 col1
i.e. use df_result['Column'] = df_result.apply(get_col_name, axis=1)
.
.
Note: there is quite a lot going on in get_col_name
so perhaps it warrants some further explanation:
In [4]: row = df_result.irow(0) # an example row to pass to get_col_name
In [5]: row
Out[5]:
value 1181.220328
Name: 1979-01-01 00:00:00
In [6]: row.name # use to get rows of df
Out[6]: <Timestamp: 1979-01-01 00:00:00>
In [7]: df.ix[row.name]
Out[7]:
col5 1181.220328
col4 912.154923
col3 648.848635
col2 390.986156
col1 138.185861
Name: 1979-01-01 00:00:00
In [8]: b = (df.ix[row.name] == row['value'])
#checks whether each elements equal row['value'] = 1181.220328
In [9]: b
Out[9]:
col5 True
col4 False
col3 False
col2 False
col1 False
Name: 1979-01-01 00:00:00
In [10]: b.argmax() # index of a True value
Out[10]: 0
In [11]: b.index[b.argmax()] # the index value (column name)
Out[11]: 'col5'
It might be there is more efficient way to do this...
Find the column name which has the maximum value for each row
You can use idxmax
with axis=1
to find the column with the greatest value on each row:
>>> df.idxmax(axis=1)
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object
To create the new column 'Max', use df['Max'] = df.idxmax(axis=1)
.
To find the row index at which the maximum value occurs in each column, use df.idxmax()
(or equivalently df.idxmax(axis=0)
).
How to select all columns whose names start with X in a pandas DataFrame
Just perform a list comprehension to create your columns:
In [28]:
filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:
df[filter_col]
Out[29]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
Another method is to create a series from the columns and use the vectorised str method startswith
:
In [33]:
df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
foo.aa foo.bars foo.fighters foo.fox foo.manchu
0 1.0 0 0 2 NA
1 2.1 0 1 4 0
2 NaN 0 NaN 1 0
3 4.7 0 0 0 0
4 5.6 0 0 0 0
5 6.8 1 0 5 0
In order to achieve what you want you need to add the following to filter the values that don't meet your ==1
criteria:
In [36]:
df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 NaN 1 NaN NaN NaN NaN NaN
1 NaN NaN NaN 1 NaN NaN NaN
2 NaN NaN NaN NaN 1 NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 1 NaN NaN NaN NaN
EDIT
OK after seeing what you want the convoluted answer is this:
In [72]:
df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
bar.baz foo.aa foo.bars foo.fighters foo.fox foo.manchu nas.foo
0 5.0 1.0 0 0 2 NA NA
1 5.0 2.1 0 1 4 0 0
2 6.0 NaN 0 NaN 1 0 1
5 6.8 6.8 1 0 5 0 0
Related Topics
How to Avoid Explicit 'Self' in Python
Collision Between Masks in Pygame
Validate Ssl Certificates with Python
Binary Representation of Float in Python (Bits Not Hex)
Cleanest Way to Get Last Item from Python Iterator
How to Pass an Operator to a Python Function
How to Install Pycrypto on Windows
How to Check the Difference, in Seconds, Between Two Dates
How to Redirect with Post Data (Django)
Reading Unicode File Data with Bom Chars in Python
Pygame.Error: Video System Not Initialized
How to Find Numeric Columns in Pandas
Jupyter Notebook with Python 3.8 - Notimplementederror
Pip Cannot Uninstall <Package>: "It Is a Distutils Installed Project"
Python Pandas Max Value in a Group as a New Column
How to Properly Assert That an Exception Gets Raised in Pytest