Take Random Sample by Group

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Random Sample by Group and Specific Probability Distribution for Groups

I believe this might work. Join the frequency tibble with your date tibble. After filtering for the month of interest, a revised frequency can be calculated based on frequency for day of the week, adjusting for number of times that day of the week appears in that month. Finally, use slice_sample with this new frequency included as weight_by (weights add up to 1, though they otherwise would be standardized to add up to 1 anyways).

library(tidyverse)

set.seed(123)

dt2021 %>%
filter(month == 1) %>%
left_join(dtWkDays) %>%
group_by(wkDay) %>%
mutate(newFreq = Freq / n()) %>%
ungroup() %>%
slice_sample(n = 1000, weight_by = newFreq, replace = TRUE) %>%
count(wkDay)

Output

  wkDay         n
<chr> <int>
1 Friday 312
2 Monday 81
3 Saturday 320
4 Sunday 10
5 Thursday 120
6 Tuesday 62
7 Wednesday 95

Random sampling of groups after pandas groupby

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

Randomly select sample from each group using weight

Borrowing from the link KoenV has posted in the comments:

library(dplyr)
library(purrr)

sample_size <- 30
groups <- c(0.7, 0.1, 0.2)
group_size <- sample_size * groups

iris %>%
group_split(Species)%>%
map2_dfr(group_size, ~ slice_sample(.x, n = .y))

# A tibble: 30 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.8 3.1 1.6 0.2 setosa
2 4.8 3.4 1.6 0.2 setosa
3 5.1 3.4 1.5 0.2 setosa
4 4.4 3 1.3 0.2 setosa
5 4.6 3.4 1.4 0.3 setosa
6 5.5 4.2 1.4 0.2 setosa
7 5.5 3.5 1.3 0.2 setosa
8 4.9 3 1.4 0.2 setosa
9 5.1 3.8 1.9 0.4 setosa
10 5.7 4.4 1.5 0.4 setosa

# A tibble: 3 × 2
Species n
<fct> <int>
1 setosa 21
2 versicolor 3
3 virginica 6

Randomly sample groups

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

Python - Pandas random sampling per group

IIUC, the issue is that you do not want to groupby the column image name, but if that column is not included in the groupby, your will lose this column

You can first create the grouby object

gb = df.groupby(['type', 'Class'])

Now you can interate over the grouby blocks using list comprehesion

blocks = [data.sample(n=1) for _,data in gb]

Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe

pd.concat(blocks)

Output

   Class    Value2 image name   type
7 A 0.817744 image02 long
17 B 0.199844 image01 long
4 A 0.462691 image01 short
11 B 0.831104 image02 short

OR

You can modify your code and add the column image name to the groupby like this

df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))

Value2 image name
type Class
long A 8 0.777962 image01
9 0.757983 image01
B 19 0.100702 image02
15 0.117642 image02
short A 3 0.465239 image02
2 0.460148 image02
B 10 0.934829 image02
11 0.831104 image02

EDIT: Keeping image same per group

Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this

import random

gb = df.groupby(['Class','type'])
ls = []

for index,frame in gb:
ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))

pd.concat(ls)

Output

   Class    Value2 image name   type
6 A 0.850445 image02 long
7 A 0.817744 image02 long
4 A 0.462691 image01 short
0 A 0.444939 image01 short
19 B 0.100702 image02 long
15 B 0.117642 image02 long
10 B 0.934829 image02 short
14 B 0.721535 image02 short

Random sample by group: how to specify n, not weight? (using DataFrameGroupBy.sample)

You can group the dataframe on country then .sample each group separately where the number of samples to take can be obtained from the dictionary, finally .concat all the sampled groups:

d = {'USA': 4, 'Canada': 2} # mapping dict
pd.concat([g.sample(d[k]) for k, g in df.groupby('country', sort=False)])


   id country
0 1 USA
4 5 USA
1 2 USA
2 3 USA
6 7 Canada
9 10 Canada


Related Topics



Leave a reply



Submit