How to select all elements greater than a given values in a dataframe
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,
so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame
, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Selecting all values greater than a number in a panda data frame
First filter only columns with Year
in columns names by DataFrame.filter
, compare all rows and then test by DataFrame.any
at least one matched value per row:
df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
Or compare all columns without first 2 selected by positons with DataFrame.iloc
:
df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
How do I select and store columns greater than a number in pandas?
Sample DF:
In [79]: df = pd.DataFrame(np.random.randint(5, 15, (10, 3)), columns=list('abc'))
In [80]: df
Out[80]:
a b c
0 6 11 11
1 14 7 8
2 13 5 11
3 13 7 11
4 13 5 9
5 5 11 9
6 9 8 6
7 5 11 10
8 8 10 14
9 7 14 13
present only those rows where b > 10
In [81]: df[df.b > 10]
Out[81]:
a b c
0 6 11 11
5 5 11 9
7 5 11 10
9 7 14 13
Minimums (for all columns) for the rows satisfying b > 10
condition
In [82]: df[df.b > 10].min()
Out[82]:
a 5
b 11
c 9
dtype: int32
Minimum (for the b
column) for the rows satisfying b > 10
condition
In [84]: df.loc[df.b > 10, 'b'].min()
Out[84]: 11
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
How to select all rows which contain values greater than a threshold?
There is absolutely no need for the double transposition - you can simply call any
along the column index (supplying 1 or 'columns'
) on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
Comparing values of a dataframe for greater than a given value
Your solution is failing because df['MAG'] >= 6.4
returns a pd.Series
of the same size as df['MAG']
with multiple booleans instead of simply one. Hence the ambiguity, hence the error. You need to check 'MAG'
and create new values for timeWindow
both at the same time. See the following:
Starting a new DataFrame:
import pandas as pd
df = pd.DataFrame([5.3, 4.2, 7.8, 9.2], columns=["MAG"])
- You can use
pd.Series.apply()
which does not require any external library:
df["timeWindow"] = df.MAG.apply(
lambda x: x * (10 ** 0.032) + 2.7389 if x >= 6.4
else x * (10 ** 0.5409) - 0.547
)
print(df)
# MAG timeWindow
# 0 5.3 17.868176
# 1 4.2 14.046158
# 2 7.8 11.135329
# 3 9.2 12.642380
- Or, as suggested by @Quang Hoang, you may also use
np.where()
:
import numpy as np
df["timeWindow"] = np.where(
df.MAG >= 6.4,
df.MAG * (10 ** 0.032) + 2.7389,
df.MAG * (10 ** 0.5409) - 0.547
)
print(df)
# MAG timeWindow
# 0 5.3 17.868176
# 1 4.2 14.046158
# 2 7.8 11.135329
# 3 9.2 12.642380
select first occurrence where column value is greater than x for each A(key) | dataframe
You can first slice the rows that match the condition on D, then groupby
A and get the first
element of each group:
df[df['D'].ge(4)].groupby('A', sort=False).first()
output:
B C D E
A
foo 2 2.1 4 5
bar 0 4.1 4 6
baz 0 4.1 5 0
Replacing values greater than a number in pandas dataframe
You can use apply
with list comprehension
:
df1['A'] = df1['A'].apply(lambda x: [y if y <= 9 else 11 for y in x])
print (df1)
A
2017-01-01 02:00:00 [11, 11, 11]
2017-01-01 03:00:00 [3, 11, 9]
Faster solution is first convert to numpy array
and then use numpy.where
:
a = np.array(df1['A'].values.tolist())
print (a)
[[33 34 39]
[ 3 43 9]]
df1['A'] = np.where(a > 9, 11, a).tolist()
print (df1)
A
2017-01-01 02:00:00 [11, 11, 11]
2017-01-01 03:00:00 [3, 11, 9]
Related Topics
How to Downgrade Python from 3.7 to 3.5 in Anaconda
How to Clear/Delete the Contents of a Tkinter Text Widget
Pip Error: Microsoft Visual C++ 14.0 Is Required
Use a Loop to Plot N Charts Python
How to Clear Only Last One Line in Python Output Console
How to Perform Union on Two Dataframes With Different Amounts of Columns in Spark
Checking If a Button Has Been Pressed in Python
Removing Backslashes from a String in Python
How to Merge Columns from Multiple CSV Files Using Python
Python Selenium, Find Out When a Download Has Completed
Python: Read Text File and Split File into List Variables, With Each Variable Having 4 Lines Each
Pytest Cannot Import Module While Python Can
How to Fill a Column With the Value of Another Column Based on a Condition on Some Other Columns
How to Split But Ignore Separators in Quoted Strings, in Python
Keep Other Columns When Doing Groupby
How to Eliminate Null Valued Cells from a CSV Dataset Using Python
Deleting Rows from CSV Based on Cell Contents from Another Csv