Sample Each Group After Pandas Groupby

Sample each group after pandas groupby

Apply a lambda and call sample with param frac:

In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})

grouped = df.groupby('b')
grouped.apply(lambda x: x.sample(frac=0.3))

Out[2]:
     a  b
b        
0 6  7  0
1 2  3  1

Random sampling of groups after pandas groupby

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
    print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

Select sample random groups after groupby in pandas?

You can do with shuffle and ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

Pandas sample different fractions for each group after groupby

You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:

Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.

grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]: 
   a  b
6  7  0
8  9  0
3  4  1

If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.

grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
                    [int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
        .reset_index()
        .rename({'index' : 'a'}, axis=1))
Out[2]: 
    a  b
0   7  0
1   8  0
2   9  0
3   7  0
4   7  0
5   8  0
6   1  1
7   3  1
8   3  1
9   1  1
10  0  1
11  0  1
12  4  1
13  2  1
14  3  1
15  0  1

Sampling after groupby on each group in python

You can use pandas.Series.sample to get a random sample of each category and you can set the number of elements to be randomly distributed in 1 ... min(4, len(category)):

import random

def random_sample(x):
    n = random.randint(1, min(4, len(x)))
    return x.sample(n)

df.groupby("accountid").transdate.apply(random_sample)
# accountid    
# 112962     13    2018-07-01
#            14    2018-09-01
#            15    2018-08-01
# 114175     10    2018-09-01
#            11    2018-08-01
# 116490     2     2018-09-01
#            0     2018-10-01
#            3     2018-08-01
# 123033     5     2018-07-01
#            4     2018-10-01
#            7     2018-08-01

How do I Sample each group from a pandas data frame at different rates

Convert sample_info to dictionary. Group population by Group ID. Pass the sample size values to DataFrame.sample using the dictionary.

mapper = sample_info.set_index('Group ID')['Sample Size'].to_dict()

population.groupby('Group ID').apply(lambda x: x.sample(n=mapper.get(x.name))).reset_index(drop = True)

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

Sample from each group in polars dataframe?

Let start with some dummy data:

n = 100
seed = 0
df = pl.DataFrame(
    {
        "groups": (pl.arange(0, n, eager=True) % 5).shuffle(seed=seed),
        "values": pl.arange(0, n, eager=True).shuffle(seed=seed)
    }
)
df

shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 55     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 40     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 57     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 99     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:

df.groupby("groups").agg(pl.count())

shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ ---    ┆ ---   │
│ i64    ┆ u32   │
╞════════╪═══════╡
│ 1      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0      ┆ 20    │
└────────┴───────┘

Sample our data

Now we are going to use a window function to take a sample of our data.

df.filter(
    pl.arange(0, pl.count()).shuffle().over("groups") < 10
)

shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 85     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 84     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 19     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

For every group in over("group") the pl.arange(0, pl.count()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.

Sample Each Group After Pandas Groupby