Sample each group after pandas groupby
Apply a lambda and call sample
with param frac
:
In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0]})
grouped = df.groupby('b')
grouped.apply(lambda x: x.sample(frac=0.3))
Out[2]:
a b
b
0 6 7 0
1 2 3 1
Random sampling of groups after pandas groupby
Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:
new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()
for _, row in new_df.iterrows():
print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))
Select sample random groups after groupby in pandas?
You can do with shuffle
and ngroup
g = df.groupby(['col1', 'col2'])
a=np.arange(g.ngroups)
np.random.shuffle(a)
df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)
Pandas sample different fractions for each group after groupby
You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True
:
- Using
np.select
, create a new columnc
that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set. - From there, you can
sample
x rows per group based off these percentage conditions. From these rows, return the.index
of the rows and filter for the rows with.loc
as well as columns'a','b'
. The codegrouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0]))
creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the.index
and filter the original dataframe with.loc
, rather than try to clean up the messy multiindex series.
grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]:
a b
6 7 0
8 9 0
3 4 1
If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True
. Then, do some cleanup to get the output.
grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
[int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
.reset_index()
.rename({'index' : 'a'}, axis=1))
Out[2]:
a b
0 7 0
1 8 0
2 9 0
3 7 0
4 7 0
5 8 0
6 1 1
7 3 1
8 3 1
9 1 1
10 0 1
11 0 1
12 4 1
13 2 1
14 3 1
15 0 1
Sampling after groupby on each group in python
You can use pandas.Series.sample
to get a random sample of each category and you can set the number of elements to be randomly distributed in 1 ... min(4, len(category))
:
import random
def random_sample(x):
n = random.randint(1, min(4, len(x)))
return x.sample(n)
df.groupby("accountid").transdate.apply(random_sample)
# accountid
# 112962 13 2018-07-01
# 14 2018-09-01
# 15 2018-08-01
# 114175 10 2018-09-01
# 11 2018-08-01
# 116490 2 2018-09-01
# 0 2018-10-01
# 3 2018-08-01
# 123033 5 2018-07-01
# 4 2018-10-01
# 7 2018-08-01
How do I Sample each group from a pandas data frame at different rates
Convert sample_info to dictionary. Group population by Group ID. Pass the sample size values to DataFrame.sample using the dictionary.
mapper = sample_info.set_index('Group ID')['Sample Size'].to_dict()
population.groupby('Group ID').apply(lambda x: x.sample(n=mapper.get(x.name))).reset_index(drop = True)
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Sample from each group in polars dataframe?
Let start with some dummy data:
n = 100
seed = 0
df = pl.DataFrame(
{
"groups": (pl.arange(0, n, eager=True) % 5).shuffle(seed=seed),
"values": pl.arange(0, n, eager=True).shuffle(seed=seed)
}
)
df
shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 55 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 40 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 57 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 99 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:
df.groupby("groups").agg(pl.count())
shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪═══════╡
│ 1 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0 ┆ 20 │
└────────┴───────┘
Sample our data
Now we are going to use a window function to take a sample of our data.
df.filter(
pl.arange(0, pl.count()).shuffle().over("groups") < 10
)
shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 85 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 84 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 19 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
For every group in over("group")
the pl.arange(0, pl.count())
expression creates an index row. We then shuffle
that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask
that we can pass to the filter
method.
Related Topics
Str.Startswith with a List of Strings to Test For
Removing the Tk Icon on a Tkinter Window
Convert List into a Dictionary
Getting Standard Errors on Fitted Parameters Using the Optimize.Leastsq Method in Python
Python - Read File from and to Specific Lines of Text
When to Close Cursors Using MySQLdb
Unbalanced Data and Weighted Cross Entropy
Pandas Equivalent of Oracle Lead/Lag Function
How to Convert an Iterable to a Stream
Valueerror: Unknown Ms Compiler Version 1900
Printing a List Separated with Commas, Without a Trailing Comma
Test Case Execution Order in Pytest
Python Argparse Conditionally Required Arguments
Beautiful Soup 4 Find_All Don't Find Links That Beautiful Soup 3 Finds