How to Select All Elements Greater Than a Given Values in a Dataframe

How to select all elements greater than a given values in a dataframe

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].

Example:

import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])

df = pd.DataFrame(b.T) #just creating the dataframe


criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)

Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect

Case 1:

type( df.iloc[:,1]>= 60 )

Returns pandas.core.series.Series,
so it gives

 df[ df.iloc[:,1]>= 60 ]

#out:
0 1
1 2 99
3 7 63

Case2:

type( df.iloc[:,1:2]>= 60 )

Returns a pandas.core.frame.DataFrame
, and gives

df[ df.iloc[:,1:2]>= 60 ]

#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0

Therefore I think it changes the way the index is processed.

Always keep in mind that 3 is a scalar, and 3:4 is a array.

For more info is always good to take a look at the official doc Pandas indexing

Selecting all values greater than a number in a panda data frame

First filter only columns with Year in columns names by DataFrame.filter, compare all rows and then test by DataFrame.any at least one matched value per row:

df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0

Or compare all columns without first 2 selected by positons with DataFrame.iloc:

df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0

How do I select and store columns greater than a number in pandas?

Sample DF:

In [79]: df = pd.DataFrame(np.random.randint(5, 15, (10, 3)), columns=list('abc'))

In [80]: df
Out[80]:
a b c
0 6 11 11
1 14 7 8
2 13 5 11
3 13 7 11
4 13 5 9
5 5 11 9
6 9 8 6
7 5 11 10
8 8 10 14
9 7 14 13

present only those rows where b > 10

In [81]: df[df.b > 10]
Out[81]:
a b c
0 6 11 11
5 5 11 9
7 5 11 10
9 7 14 13

Minimums (for all columns) for the rows satisfying b > 10 condition

In [82]: df[df.b > 10].min()
Out[82]:
a 5
b 11
c 9
dtype: int32

Minimum (for the b column) for the rows satisfying b > 10 condition

In [84]: df.loc[df.b > 10, 'b'].min()
Out[84]: 11

UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

How to select all rows which contain values greater than a threshold?

There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.

df[(df > threshold).any(1)]

Example

>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))

>>> df

0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96

>>> df[(df > 95).any(1)]

0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96

Transposing as your self-answer does is just an unnecessary performance hit.

df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))

# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop

# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop

Comparing values of a dataframe for greater than a given value

Your solution is failing because df['MAG'] >= 6.4 returns a pd.Series of the same size as df['MAG'] with multiple booleans instead of simply one. Hence the ambiguity, hence the error. You need to check 'MAG' and create new values for timeWindow both at the same time. See the following:

Starting a new DataFrame:

import pandas as pd
df = pd.DataFrame([5.3, 4.2, 7.8, 9.2], columns=["MAG"])
  1. You can use pd.Series.apply() which does not require any external library:
df["timeWindow"] = df.MAG.apply(
lambda x: x * (10 ** 0.032) + 2.7389 if x >= 6.4
else x * (10 ** 0.5409) - 0.547
)

print(df)
# MAG timeWindow
# 0 5.3 17.868176
# 1 4.2 14.046158
# 2 7.8 11.135329
# 3 9.2 12.642380

  1. Or, as suggested by @Quang Hoang, you may also use np.where():
import numpy as np
df["timeWindow"] = np.where(
df.MAG >= 6.4,
df.MAG * (10 ** 0.032) + 2.7389,
df.MAG * (10 ** 0.5409) - 0.547
)

print(df)
# MAG timeWindow
# 0 5.3 17.868176
# 1 4.2 14.046158
# 2 7.8 11.135329
# 3 9.2 12.642380

select first occurrence where column value is greater than x for each A(key) | dataframe

You can first slice the rows that match the condition on D, then groupby A and get the first element of each group:

df[df['D'].ge(4)].groupby('A', sort=False).first()

output:

     B    C  D  E
A
foo 2 2.1 4 5
bar 0 4.1 4 6
baz 0 4.1 5 0

Replacing values greater than a number in pandas dataframe

You can use apply with list comprehension:

df1['A'] = df1['A'].apply(lambda x: [y if y <= 9 else 11 for y in x])
print (df1)
A
2017-01-01 02:00:00 [11, 11, 11]
2017-01-01 03:00:00 [3, 11, 9]

Faster solution is first convert to numpy array and then use numpy.where:

a = np.array(df1['A'].values.tolist())
print (a)
[[33 34 39]
[ 3 43 9]]

df1['A'] = np.where(a > 9, 11, a).tolist()
print (df1)
A
2017-01-01 02:00:00 [11, 11, 11]
2017-01-01 03:00:00 [3, 11, 9]


Related Topics



Leave a reply



Submit