Pandas - find index of value anywhere in DataFrame
Supposing that your DataFrame looks like the following :
0 1 2 3 4
0 a er tfr sdf 34
1 rt tyh fgd thy rer
2 1 2 3 4 5
3 6 7 8 9 10
4 dsf wew security_id name age
5 dfs bgbf 121 jason 34
6 dddp gpot 5754 mike 37
7 fpoo werwrw 342 jack 31
Do the following :
for row in range(df.shape[0]): # df is the DataFrame
for col in range(df.shape[1]):
if df.get_value(row,col) == 'security_id':
print(row, col)
break
Search for a value anywhere in a pandas DataFrame
You can perform equality comparison on the entire DataFrame:
df[df.eq(var1).any(1)]
Python Pandas: Get index of rows which column matches certain value
df.iloc[i]
returns the ith
row of df
. i
does not refer to the index label, i
is a 0-based index.
In contrast, the attribute index
returns actual index labels, not numeric row-indices:
df.index[df['BoolCol'] == True].tolist()
or equivalently,
df.index[df['BoolCol']].tolist()
You can see the difference quite clearly by playing with a DataFrame with
a non-default index that does not equal to the row's numerical position:
df = pd.DataFrame({'BoolCol': [True, False, False, True, True]},
index=[10,20,30,40,50])
In [53]: df
Out[53]:
BoolCol
10 True
20 False
30 False
40 True
50 True
[5 rows x 1 columns]
In [54]: df.index[df['BoolCol']].tolist()
Out[54]: [10, 40, 50]
If you want to use the index,
In [56]: idx = df.index[df['BoolCol']]
In [57]: idx
Out[57]: Int64Index([10, 40, 50], dtype='int64')
then you can select the rows using loc
instead of iloc
:
In [58]: df.loc[idx]
Out[58]:
BoolCol
10 True
40 True
50 True
[3 rows x 1 columns]
Note that loc
can also accept boolean arrays:
In [55]: df.loc[df['BoolCol']]
Out[55]:
BoolCol
10 True
40 True
50 True
[3 rows x 1 columns]
If you have a boolean array, mask
, and need ordinal index values, you can compute them using np.flatnonzero
:
In [110]: np.flatnonzero(df['BoolCol'])
Out[112]: array([0, 3, 4])
Use df.iloc
to select rows by ordinal index:
In [113]: df.iloc[np.flatnonzero(df['BoolCol'])]
Out[113]:
BoolCol
10 True
40 True
50 True
Finding the index for a value in a Pandas Dataframe
You're essentially looking for two conditions. For the first condition, you want the given value to be greater than 0.1:
df['value'].gt(0.1)
For the second condition, you want the previous non-null value to be less than 0.1:
df['value'].ffill().shift().lt(0.1)
Now, combine the two conditions with the and operator, reverse the resulting Boolean indexer, and use idxmax
to find the the first (last) instance where your condition holds:
(df['value'].gt(0.1) & df['value'].ffill().shift().lt(0.1))[::-1].idxmax()
Which gives the expected index value.
The above method assumes that at least one value satisfies the situation you've described. If it's possible that your data may not satisfy your situation you may want to use any
to verify that a solution exists:
# Build the condition.
cond = (df['value'].gt(0.1) & df['value'].ffill().shift().lt(0.1))[::-1]
# Check if the condition is met anywhere.
if cond.any():
idx = cond.idxmax()
else:
idx = ???
In you're question, you've specified both inequalities to be strict. What happens for a value exactly equal to 0.1? You may want to change one of the gt
/lt
to ge
/le
to account for this.
speed up pandas search for a certain value not in the whole df
Just to make a full answer out of my comment:
With -1 not in test1.values
you can check if -1
is in your DataFrame.
Regarding the performance, this still needs to check every single value, which is in your case
10^5*10^2 = 10^7
.
You only save with this the performance cost for summation and an additional comparison of these results.
How to find Value at specific index in an array in a dataframe?
This line isn't doing what you think it's doing:
w=AvgT.SRi[maxsa]
You are accessing the value of SRi in row maxsa of the dataframe -- that is, you are getting the whole list. I assume you are getting an IndexError because in at least one instance, the argmax of SAi is higher than the number of rows in your dataframe.
Try replacing that line with this:
w=AvgT.SRi[index][maxsa]
Get the indexes for the top 3 values from a dataframe row (using a fast implementation)
you can use np.sort
with axis=1, use [:,::-1]
to reverse the order of the sort and then [:,:3]
to select the first 3 columns of the array. Then recreate the dataframe
#input
import numpy as np
np.random.seed(3)
df = pd.DataFrame(np.random.randint(0,100,100).reshape(10, 10),
columns=list('abcdefghij'))
# sort
top3 = pd.DataFrame(np.sort(df, axis=1)[:, ::-1][:,:3])
print(top3)
0 1 2
0 74 72 56
1 96 93 81
2 90 90 69
3 97 79 62
4 94 78 64
5 85 71 63
6 99 91 80
7 96 95 61
8 91 90 74
9 88 60 56
EDIT: OP changed the question to extract the columns' names of the top 3 values per row, that can be done with argsort
and slicing the columns names:
print(pd.DataFrame(df.columns.to_numpy()
[np.argsort(df.to_numpy(), axis=1)][:, -1:-4:-1]))
Related Topics
Multiprocessing: How to Use Pool.Map on a Function Defined in a Class
How to Sort a List of Lists by a Specific Index of the Inner List
How to Properly Setup Pipenv in Pycharm
Python - Get Last Element After Str.Split()
How to Ignore Null Byte When Reading a CSV File
How to Convert a Float into Hex
Putting Multiple Conditions Using Np.Where on Python Pandas
Why Am I Getting Ioerror: [Errno 13] Permission Denied
How to Read from S3 in Pyspark Running in Local Mode
Typeerror: Unsupported Operand Type(S) for ** or Pow(): 'List' and 'Int'
Python - Split a List of Dicts into Individual Dicts
Importerror: No Module Named Psycopg2 After Install
Sqlalchemy: How to Filter Date Field
Numpy: Checking If a Value Is Nat
Split Datetime Column into a Date and Time Python