Random Row Selection in Pandas Dataframe

Random row selection in Pandas dataframe

Something like this?

import random

def some(x, n):
return x.ix[random.sample(x.index, n)]

Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.

Randomly selecting a subset of rows from a pandas dataframe based on existing column values

I hope this code snippet will work for you

samples = []
for group in df.GroupID.unique():
s = df.loc[df.GroupID== group].sample(n=1).reset_index(drop=True)
samples.append(s)

sample = pd.concat(samples, axis=0)

The code will take each 'GroupID' and sample observation from that subgroup.
You can concatenate the subsamples (with one GroupID) for the needed sample.

Pandas dataframe random row selection per group with a boolean condition

I would first splice out all the dates which don't satisfy that criteria:

In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0 2015-04-13 23:25:55
1 2015-04-08 17:57:29
2 2015-04-12 23:29:11
3 2015-04-08 17:57:29
4 2015-02-20 10:33:48
5 2015-02-20 10:33:48
6 2015-02-20 10:33:48
7 2015-02-20 10:33:48
8 2015-04-08 17:57:29
9 2015-04-13 23:25:55
10 2015-04-13 23:25:55
11 2015-04-12 23:29:11
12 2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]

In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 True
9 True
10 False
11 True
12 True
Name: date, dtype: bool

In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]

In [14]: df_old
Out[14]:
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
4 2015-01-30 03:51:12 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa

Now it becomes a much easier problem: pick a random row by name:

df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])

In [21]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[21]:
date
name
Dave 2015-04-01 21:36:55
John 2015-02-18 14:10:40
Lisa 2014-12-16 22:50:55
Simon 2014-12-15 23:54:03

In [22]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[22]:
date
name
Dave 2015-01-31 07:14:39
John 2015-02-18 14:10:40
Lisa 2014-12-18 00:15:02
Simon 2014-12-16 19:53:53

Randomly select rows from DataFrame Pandas

The in-built sample function provides a frac argument to give the fraction contained in the sample.

If your DataFrame of people is people_df:

percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)

people_df['is_selected'] = people_df.index.isin(sample_df.index)

Pandas: select rows by random groups while keeping all of the group's variables

You need create random ids first and then compare original column id by Series.isin in boolean indexing:

#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111

Or:

N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

Randomly select rows from Pandas DataFrame based on multiple criteria

I am not sure did I get the question right or not, but at least this answer will help other to give you a answer
If this is not what you are looking for, please give me shot

import pandas as pd
#your dataframe
maindf = {'PM Owner': ['A', 'B','C','A','E','F'], 'Risk Tier': [1,3,1,1,1,2],'sam' :['A0','B0','C0','D0','E0','F0']}
Maindf = pd.DataFrame(data=maindf)


#what you are looking for
filterdf = {'PM Owner': ['A' ], 'Risk Tier': [ 1 ]}
Filterdf = pd.DataFrame(data=filterdf)


#Filtering
NewMaindf= (Maindf[Maindf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1).isin(
Filterdf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1))])
#Just one sample
print( (NewMaindf).sample())
#whole dataset after filtering
print( (NewMaindf) )

Result :

 PM Owner  Risk Tier sam
3 A 1 D0
PM Owner Risk Tier sam
0 A 1 A0
3 A 1 D0

Random selection of a row from a pandas DataFrame with weights

You should scale the weight so it matches the expected distribution:

weights = {-1:0.1, 0:0.4, 1:0.5}

scaled_weights = (pd.Series(weights) / df.label.value_counts(normalize=True))

df.sample(n=1, weights=df.label.map(scaled_weights) )

Test distribution with 10000 samples

(df.sample(n=10000, replace=True, random_state=1,
weights=df.label.map(scaled_weights))
.label.value_counts(normalize=True)
)

Output:

 1    0.5060
0 0.3979
-1 0.0961
Name: label, dtype: float64

How to randomly select rows from a data set using pandas?

I think you can use sample - 9k or 25% rows:

df.sample(n=9000)

Or:

df.sample(frac=0.25)

Another solution with creating random sample of index by numpy.random.choice and then select by loc - index has to be unique:

df = df.loc[np.random.choice(df.index, size=9000)]

Solution if not unique index:

df = df.iloc[np.random.choice(np.arange(len(df)), size=9000)]


Related Topics



Leave a reply



Submit