Sample Random Rows in Dataframe

Random row selection in Pandas dataframe

Something like this?

import random

def some(x, n):
return x.ix[random.sample(x.index, n)]

Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.

Sample random rows in dataframe

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
X1 X2
1 0.7091409 -1.4061361
2 -1.1334614 -0.1973846
3 2.3343391 -0.4385071
4 -0.9040278 -0.6593677
5 0.4180331 -1.2592415
6 0.7572246 -0.5463655
7 -0.8996483 0.4231117
8 -1.0356774 -0.1640883
9 -0.3983045 0.7157506
10 -0.9060305 2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
X1 X2
9 -0.3983045 0.7157506
2 -1.1334614 -0.1973846
10 -0.9060305 2.3234110

How can I select a sequence of random rows from a pandas DataFrame?

Choose a random row n and then take the n to n+5 rows

n = random.randint(0, rows_in_dataframe-5)

five_random_consecutive_rows = dataframe[n:n+5]

Random selection of a row from a pandas DataFrame with weights

You should scale the weight so it matches the expected distribution:

weights = {-1:0.1, 0:0.4, 1:0.5}

scaled_weights = (pd.Series(weights) / df.label.value_counts(normalize=True))

df.sample(n=1, weights=df.label.map(scaled_weights) )

Test distribution with 10000 samples

(df.sample(n=10000, replace=True, random_state=1,
weights=df.label.map(scaled_weights))
.label.value_counts(normalize=True)
)

Output:

 1    0.5060
0 0.3979
-1 0.0961
Name: label, dtype: float64

Populate Pandas dataframe with random sample from another dataframe if condition is met, when columns to be assigned are not independent

Here's one approach:

(i) get the sample sizes from df2 with groupby + size.

(ii) use groupby + apply where we use a lambda function to sample items from df1 with the sample sizes obtained from (i) for each unique "B".

(iii) assign these sampled values to df2 (since "B" is not unique, we sorted df2 by "B" to make the rows align)

cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()

df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
.groupby('B')[cols]
.apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
.droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()

One sample:

   A  B   C    D
0 5 1 5 0.6
1 5 2 10 0.7
2 6 1 12 0.6
3 6 2 11 0.5
4 6 3 4 0.1

Randomly select rows from Pandas DataFrame based on multiple criteria

I am not sure did I get the question right or not, but at least this answer will help other to give you a answer
If this is not what you are looking for, please give me shot

import pandas as pd
#your dataframe
maindf = {'PM Owner': ['A', 'B','C','A','E','F'], 'Risk Tier': [1,3,1,1,1,2],'sam' :['A0','B0','C0','D0','E0','F0']}
Maindf = pd.DataFrame(data=maindf)


#what you are looking for
filterdf = {'PM Owner': ['A' ], 'Risk Tier': [ 1 ]}
Filterdf = pd.DataFrame(data=filterdf)


#Filtering
NewMaindf= (Maindf[Maindf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1).isin(
Filterdf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1))])
#Just one sample
print( (NewMaindf).sample())
#whole dataset after filtering
print( (NewMaindf) )

Result :

 PM Owner  Risk Tier sam
3 A 1 D0
PM Owner Risk Tier sam
0 A 1 A0
3 A 1 D0

How to randomly sample multiple consecutive rows of a dataframe in R?

df <- mtcars
df$row_nm <- seq(nrow(df))

set.seed(7)

sample_seq <- function(n, N) {
i <- sample(seq(N), size = 1)

ifelse(
test = i + (seq(n) - 1) <= N,
yes = i + (seq(n) - 1),
no = i + (seq(n) - 1) - N
)
}

replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))

# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#>
#> [[3]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#>
#> [[4]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#>
#> [[5]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16

Created on 2022-01-24 by the reprex package (v2.0.1)

Randomly sample rows based on year-month

Use DataFrame.groupby per years and months or month periods and use custom lambda function with DataFrame.sample:

df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))

Or:

df1 = (df.groupby(df['daate'].dt.to_period('m'), group_keys=False)
.apply(lambda x: x.sample(n=10)))

Sample:

data = {'daate':pd.date_range('2019-01-01', '2020-01-22'),
'tweets':np.random.choice(["aaa", "bbb", "ccc", "ddd"], 387)
}

df = pd.DataFrame(data)


df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
print (df1)
date tweets daate
9 2019-01-10 bbb 2019-01-10
29 2019-01-30 ddd 2019-01-30
17 2019-01-18 ccc 2019-01-18
12 2019-01-13 ccc 2019-01-13
20 2019-01-21 ddd 2019-01-21
.. ... ... ...
381 2020-01-17 bbb 2020-01-17
375 2020-01-11 aaa 2020-01-11
373 2020-01-09 bbb 2020-01-09
368 2020-01-04 aaa 2020-01-04
382 2020-01-18 bbb 2020-01-18

[130 rows x 3 columns]


Related Topics



Leave a reply



Submit