Python: Random Selection Per Group

Python: Random selection per group

size = 2        # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)

Random sampling of groups after pandas groupby

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
    print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group

Another option is to sufffle your two columns with sample and drop_duplicates by col1, so that you keep only one couple per col1 value. then merge the result to df to select all the rows with these couples.

print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
  col1 col2  val
0    b    s    7
1    b    s    9
2    b    s   11
3    a    s    8
4    a    s   10

or with groupby and sample a bit the same idea but to select only one row per col1 value with merge after

df.merge(df[['col1','col2']].groupby('col1').sample(n=1))

EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query each separately:

m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)

Python - Pandas random sampling per group

IIUC, the issue is that you do not want to groupby the column image name, but if that column is not included in the groupby, your will lose this column

You can first create the grouby object

gb = df.groupby(['type', 'Class'])

Now you can interate over the grouby blocks using list comprehesion

blocks = [data.sample(n=1) for _,data in gb]

Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe

pd.concat(blocks)

Output

   Class    Value2 image name   type
7      A  0.817744    image02   long
17     B  0.199844    image01   long
4      A  0.462691    image01  short
11     B  0.831104    image02  short

You can modify your code and add the column image name to the groupby like this

df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))

                  Value2 image name
type  Class
long  A     8   0.777962    image01
            9   0.757983    image01
      B     19  0.100702    image02
            15  0.117642    image02
short A     3   0.465239    image02
            2   0.460148    image02
      B     10  0.934829    image02
            11  0.831104    image02

EDIT: Keeping image same per group

Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this

import random

gb = df.groupby(['Class','type'])
ls = []

for index,frame in gb:
    ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))

pd.concat(ls)

Output

   Class    Value2 image name   type
6      A  0.850445    image02   long
7      A  0.817744    image02   long
4      A  0.462691    image01  short
0      A  0.444939    image01  short
19     B  0.100702    image02   long
15     B  0.117642    image02   long
10     B  0.934829    image02  short
14     B  0.721535    image02  short

Randomly selecting a subset of rows from a pandas dataframe based on existing column values

I hope this code snippet will work for you

samples = []
for group in df.GroupID.unique():
    s = df.loc[df.GroupID== group].sample(n=1).reset_index(drop=True)
    samples.append(s)
    
sample = pd.concat(samples, axis=0)

The code will take each 'GroupID' and sample observation from that subgroup.
You can concatenate the subsamples (with one GroupID) for the needed sample.

How to randomly select fixed number of rows (if greater) per group else select all rows in pandas?

You can choose to sample only if you have more row:

n = 2
(df.groupby('Group_Id')
   .apply(lambda x: x.sample(n) if len(x)>n else x  )
   .reset_index(drop=True)
)

You can also try shuffling the whole data and groupby().head():

df.sample(frac=1).groupby('Group_Id').head(2)

Output:

  Name  Group_Id
5  DEF         3
0  AAA         1
2  BDF         1
3  CCC         2
4  XYZ         2

Randomly select 50% of records from 3 different groups for A/B test

Use groupby.sample to choose 50% records per group and then assign the labels with np.where:

control = df.groupby('Buyer Intent').sample(frac=0.5).index

df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

#    ID   Buyer Intent           Email      Group
# 0   1     Low Intent  john@gmail.com    Control
# 1   2  Medium Intent  jane@gmail.com    Control
# 2   3  Medium Intent   tom@gmail.com  Treatment
# 3   4     Low Intent  sara@gmail.com  Treatment
# 4   5    High Intent  mich@gmail.com    Control
# 5   6    High Intent  sall@gmail.com  Treatment

Note that groupby.sample already randomizes:

Return a random sample of items from each group.

But to shuffle explicitly, you can add DataFrame.sample with frac=1:

# shuffle df
df = df.sample(frac=1)

# same as before
control = df.groupby('Buyer Intent').sample(frac=0.5).index
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

If you don't have groupby.sample (pandas < 1.1.0):

Try groupby.apply + DataFrame.sample:

control = df.groupby('Buyer Intent').apply(lambda g: g.sample(frac=0.5))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

Or groupby.apply + np.random.choice:

control = df.groupby('Buyer Intent').apply(lambda g: np.random.choice(g.index, int(len(g)/2)))
df['Group'] = np.where(df.index.isin(control), 'Control', 'Treatment')

Pandas: select rows by random groups while keeping all of the group's variables

You need create random ids first and then compare original column id by Series.isin in boolean indexing:

#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
  id      std  number
0  A      1.0       1
1  A      0.0      12
5  C    134.0      90
6  C   1234.0     100
7  C  12345.0     111

Or:

N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

Python: Random Selection Per Group