Python: Random selection per group
size = 2 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)
Random sampling of groups after pandas groupby
Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:
new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()
for _, row in new_df.iterrows():
print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))
pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group
Another option is to sufffle your two columns with sample
and drop_duplicates
by col1, so that you keep only one couple per col1 value. then merge
the result to df to select all the rows with these couples.
print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
col1 col2 val
0 b s 7
1 b s 9
2 b s 11
3 a s 8
4 a s 10
or with groupby
and sample
a bit the same idea but to select only one row per col1 value with merge
after
df.merge(df[['col1','col2']].groupby('col1').sample(n=1))
EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query
each separately:
m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)
Python - Pandas random sampling per group
IIUC, the issue is that you do not want to groupby the column image name
, but if that column is not included in the groupby, your will lose this column
You can first create the grouby object
gb = df.groupby(['type', 'Class'])
Now you can interate over the grouby blocks using list comprehesion
blocks = [data.sample(n=1) for _,data in gb]
Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe
pd.concat(blocks)
Output
Class Value2 image name type
7 A 0.817744 image02 long
17 B 0.199844 image01 long
4 A 0.462691 image01 short
11 B 0.831104 image02 short
OR
You can modify your code and add the column image name
to the groupby like this
df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))
Value2 image name
type Class
long A 8 0.777962 image01
9 0.757983 image01
B 19 0.100702 image02
15 0.117642 image02
short A 3 0.465239 image02
2 0.460148 image02
B 10 0.934829 image02
11 0.831104 image02
EDIT: Keeping image same per group
Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this
import random
gb = df.groupby(['Class','type'])
ls = []
for index,frame in gb:
ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))
pd.concat(ls)
Output
Class Value2 image name type
6 A 0.850445 image02 long
7 A 0.817744 image02 long
4 A 0.462691 image01 short
0 A 0.444939 image01 short
19 B 0.100702 image02 long
15 B 0.117642 image02 long
10 B 0.934829 image02 short
14 B 0.721535 image02 short
Randomly selecting a subset of rows from a pandas dataframe based on existing column values
I hope this code snippet will work for you
samples = []
for group in df.GroupID.unique():
s = df.loc[df.GroupID== group].sample(n=1).reset_index(drop=True)
samples.append(s)
sample = pd.concat(samples, axis=0)
The code will take each 'GroupID' and sample observation from that subgroup.
You can concatenate the subsamples (with one GroupID) for the needed sample.
How to randomly select fixed number of rows (if greater) per group else select all rows in pandas?
You can choose to sample only if you have more row:
n = 2
(df.groupby('Group_Id')
.apply(lambda x: x.sample(n) if len(x)>n else x )
.reset_index(drop=True)
)
You can also try shuffling the whole data and groupby().head()
:
df.sample(frac=1).groupby('Group_Id').head(2)
Output:
Name Group_Id
5 DEF 3
0 AAA 1
2 BDF 1
3 CCC 2
4 XYZ 2
Randomly select 50% of records from 3 different groups for A/B test
Use groupby.sample
to choose 50% records per group and then assign the labels with np.where
:
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
# ID Buyer Intent Email Group
# 0 1 Low Intent john@gmail.com Control
# 1 2 Medium Intent jane@gmail.com Control
# 2 3 Medium Intent tom@gmail.com Treatment
# 3 4 Low Intent sara@gmail.com Treatment
# 4 5 High Intent mich@gmail.com Control
# 5 6 High Intent sall@gmail.com Treatment
Note that groupby.sample
already randomizes:
Return a random sample of items from each group.
But to shuffle explicitly, you can add DataFrame.sample
with frac=1
:
# shuffle df
df = df.sample(frac=1)
# same as before
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
If you don't have groupby.sample
(pandas < 1.1.0):
Try
groupby.apply
+DataFrame.sample
:control = df.groupby('Buyer Intent').apply(lambda g: g.sample(frac=0.5))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')Or
groupby.apply
+np.random.choice
:control = df.groupby('Buyer Intent').apply(lambda g: np.random.choice(g.index, int(len(g)/2)))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')
Pandas: select rows by random groups while keeping all of the group's variables
You need create random id
s first and then compare original column id
by Series.isin
in boolean indexing
:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]
Related Topics
Multiprocessing VS Multithreading VS Asyncio in Python 3
Do I Need to Import Submodules Directly
How to Isolate Everything Inside of a Contour, Scale It, and Test the Similarity to an Image
Download and Save PDF File with Python Requests Module
How to Treat Python Argparse.Namespace() as a Dictionary
Why Does Indexing Numpy Arrays with Brackets and Commas Differ in Behavior
Find Substring in String But Only If Whole Words
What's a Good Equivalent to Subprocess.Check_Call That Returns the Contents of Stdout
Integer Overflow in Numpy Arrays
How to Avoid "Permission Denied" When Using Pip with Virtualenv
Changing Order of Unit Tests in Python
How to Grab Number After Word in Python
Reload Flask App When Template File Changes
Finding Elements Not in a List