Random row selection in Pandas dataframe
Something like this?
import random
def some(x, n):
return x.ix[random.sample(x.index, n)]
Note: As of Pandas v0.20.0, ix
has been deprecated in favour of loc
for label based indexing.
Randomly selecting a subset of rows from a pandas dataframe based on existing column values
I hope this code snippet will work for you
samples = []
for group in df.GroupID.unique():
s = df.loc[df.GroupID== group].sample(n=1).reset_index(drop=True)
samples.append(s)
sample = pd.concat(samples, axis=0)
The code will take each 'GroupID' and sample observation from that subgroup.
You can concatenate the subsamples (with one GroupID) for the needed sample.
Pandas dataframe random row selection per group with a boolean condition
I would first splice out all the dates which don't satisfy that criteria:
In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0 2015-04-13 23:25:55
1 2015-04-08 17:57:29
2 2015-04-12 23:29:11
3 2015-04-08 17:57:29
4 2015-02-20 10:33:48
5 2015-02-20 10:33:48
6 2015-02-20 10:33:48
7 2015-02-20 10:33:48
8 2015-04-08 17:57:29
9 2015-04-13 23:25:55
10 2015-04-13 23:25:55
11 2015-04-12 23:29:11
12 2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]
In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 True
9 True
10 False
11 True
12 True
Name: date, dtype: bool
In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]
In [14]: df_old
Out[14]:
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
4 2015-01-30 03:51:12 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
Now it becomes a much easier problem: pick a random row by name:
df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
In [21]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[21]:
date
name
Dave 2015-04-01 21:36:55
John 2015-02-18 14:10:40
Lisa 2014-12-16 22:50:55
Simon 2014-12-15 23:54:03
In [22]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
Out[22]:
date
name
Dave 2015-01-31 07:14:39
John 2015-02-18 14:10:40
Lisa 2014-12-18 00:15:02
Simon 2014-12-16 19:53:53
Randomly select rows from DataFrame Pandas
The in-built sample
function provides a frac
argument to give the fraction contained in the sample.
If your DataFrame
of people is people_df
:
percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)
people_df['is_selected'] = people_df.index.isin(sample_df.index)
Pandas: select rows by random groups while keeping all of the group's variables
You need create random id
s first and then compare original column id
by Series.isin
in boolean indexing
:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]
Randomly select rows from Pandas DataFrame based on multiple criteria
I am not sure did I get the question right or not, but at least this answer will help other to give you a answer
If this is not what you are looking for, please give me shot
import pandas as pd
#your dataframe
maindf = {'PM Owner': ['A', 'B','C','A','E','F'], 'Risk Tier': [1,3,1,1,1,2],'sam' :['A0','B0','C0','D0','E0','F0']}
Maindf = pd.DataFrame(data=maindf)
#what you are looking for
filterdf = {'PM Owner': ['A' ], 'Risk Tier': [ 1 ]}
Filterdf = pd.DataFrame(data=filterdf)
#Filtering
NewMaindf= (Maindf[Maindf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1).isin(
Filterdf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1))])
#Just one sample
print( (NewMaindf).sample())
#whole dataset after filtering
print( (NewMaindf) )
Result :
PM Owner Risk Tier sam
3 A 1 D0
PM Owner Risk Tier sam
0 A 1 A0
3 A 1 D0
Random selection of a row from a pandas DataFrame with weights
You should scale the weight so it matches the expected distribution:
weights = {-1:0.1, 0:0.4, 1:0.5}
scaled_weights = (pd.Series(weights) / df.label.value_counts(normalize=True))
df.sample(n=1, weights=df.label.map(scaled_weights) )
Test distribution with 10000 samples
(df.sample(n=10000, replace=True, random_state=1,
weights=df.label.map(scaled_weights))
.label.value_counts(normalize=True)
)
Output:
1 0.5060
0 0.3979
-1 0.0961
Name: label, dtype: float64
How to randomly select rows from a data set using pandas?
I think you can use sample
- 9k
or 25%
rows:
df.sample(n=9000)
Or:
df.sample(frac=0.25)
Another solution with creating random sample of index
by numpy.random.choice
and then select by loc
- index
has to be unique:
df = df.loc[np.random.choice(df.index, size=9000)]
Solution if not unique index:
df = df.iloc[np.random.choice(np.arange(len(df)), size=9000)]
Related Topics
Python Matplotlib Framework Under MACosx
Use of True, False, and None as Return Values in Python Functions
Flask Importerror: No Module Named Flask
How to Merge Two Lists into a Single List
Python Django Global Variables
Variable Defined with With-Statement Available Outside of With-Block
Finding First and Last Index of Some Value in a List in Python
Python Equivalent to 'Hold On' in Matlab
Prevent Python from Caching the Imported Modules
When Do You Use 'Self' in Python
How to Call Python Code from C Code
How to Read Unicode Input and Compare Unicode Strings in Python
Using Python's Multiprocessing Module to Execute Simultaneous and Separate Seawat/Modflow Model Runs
Python Module to Change System Date and Time