Random row selection in Pandas dataframe
Something like this?
import random
def some(x, n):
return x.ix[random.sample(x.index, n)]
Note: As of Pandas v0.20.0, ix
has been deprecated in favour of loc
for label based indexing.
Sample random rows in dataframe
First make some data:
> df = data.frame(matrix(rnorm(20), nrow=10))
> df
X1 X2
1 0.7091409 -1.4061361
2 -1.1334614 -0.1973846
3 2.3343391 -0.4385071
4 -0.9040278 -0.6593677
5 0.4180331 -1.2592415
6 0.7572246 -0.5463655
7 -0.8996483 0.4231117
8 -1.0356774 -0.1640883
9 -0.3983045 0.7157506
10 -0.9060305 2.3234110
Then select some rows at random:
> df[sample(nrow(df), 3), ]
X1 X2
9 -0.3983045 0.7157506
2 -1.1334614 -0.1973846
10 -0.9060305 2.3234110
How can I select a sequence of random rows from a pandas DataFrame?
Choose a random row n and then take the n to n+5 rows
n = random.randint(0, rows_in_dataframe-5)
five_random_consecutive_rows = dataframe[n:n+5]
Random selection of a row from a pandas DataFrame with weights
You should scale the weight so it matches the expected distribution:
weights = {-1:0.1, 0:0.4, 1:0.5}
scaled_weights = (pd.Series(weights) / df.label.value_counts(normalize=True))
df.sample(n=1, weights=df.label.map(scaled_weights) )
Test distribution with 10000 samples
(df.sample(n=10000, replace=True, random_state=1,
weights=df.label.map(scaled_weights))
.label.value_counts(normalize=True)
)
Output:
1 0.5060
0 0.3979
-1 0.0961
Name: label, dtype: float64
Populate Pandas dataframe with random sample from another dataframe if condition is met, when columns to be assigned are not independent
Here's one approach:
(i) get the sample sizes from df2
with groupby
+ size
.
(ii) use groupby
+ apply
where we use a lambda function to sample items from df1
with the sample sizes obtained from (i) for each unique "B".
(iii) assign these sampled values to df2
(since "B" is not unique, we sorted df2
by "B" to make the rows align)
cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()
df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
.groupby('B')[cols]
.apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
.droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()
One sample:
A B C D
0 5 1 5 0.6
1 5 2 10 0.7
2 6 1 12 0.6
3 6 2 11 0.5
4 6 3 4 0.1
Randomly select rows from Pandas DataFrame based on multiple criteria
I am not sure did I get the question right or not, but at least this answer will help other to give you a answer
If this is not what you are looking for, please give me shot
import pandas as pd
#your dataframe
maindf = {'PM Owner': ['A', 'B','C','A','E','F'], 'Risk Tier': [1,3,1,1,1,2],'sam' :['A0','B0','C0','D0','E0','F0']}
Maindf = pd.DataFrame(data=maindf)
#what you are looking for
filterdf = {'PM Owner': ['A' ], 'Risk Tier': [ 1 ]}
Filterdf = pd.DataFrame(data=filterdf)
#Filtering
NewMaindf= (Maindf[Maindf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1).isin(
Filterdf[['PM Owner','Risk Tier']].astype(str).sum(axis = 1))])
#Just one sample
print( (NewMaindf).sample())
#whole dataset after filtering
print( (NewMaindf) )
Result :
PM Owner Risk Tier sam
3 A 1 D0
PM Owner Risk Tier sam
0 A 1 A0
3 A 1 D0
How to randomly sample multiple consecutive rows of a dataframe in R?
df <- mtcars
df$row_nm <- seq(nrow(df))
set.seed(7)
sample_seq <- function(n, N) {
i <- sample(seq(N), size = 1)
ifelse(
test = i + (seq(n) - 1) <= N,
yes = i + (seq(n) - 1),
no = i + (seq(n) - 1) - N
)
}
replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))
# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#>
#> [[3]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#>
#> [[4]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#>
#> [[5]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
Created on 2022-01-24 by the reprex package (v2.0.1)
Randomly sample rows based on year-month
Use DataFrame.groupby
per years and months or month periods and use custom lambda function with DataFrame.sample
:
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
Or:
df1 = (df.groupby(df['daate'].dt.to_period('m'), group_keys=False)
.apply(lambda x: x.sample(n=10)))
Sample:
data = {'daate':pd.date_range('2019-01-01', '2020-01-22'),
'tweets':np.random.choice(["aaa", "bbb", "ccc", "ddd"], 387)
}
df = pd.DataFrame(data)
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
print (df1)
date tweets daate
9 2019-01-10 bbb 2019-01-10
29 2019-01-30 ddd 2019-01-30
17 2019-01-18 ccc 2019-01-18
12 2019-01-13 ccc 2019-01-13
20 2019-01-21 ddd 2019-01-21
.. ... ... ...
381 2020-01-17 bbb 2020-01-17
375 2020-01-11 aaa 2020-01-11
373 2020-01-09 bbb 2020-01-09
368 2020-01-04 aaa 2020-01-04
382 2020-01-18 bbb 2020-01-18
[130 rows x 3 columns]
Related Topics
How to Select the Rows With Maximum Values in Each Group With Dplyr
How to Plot With 2 Different Y-Axes
Drop Data Frame Columns by Name
Convert a List to a Data Frame
How to Escape Backslashes in R String
Include Levels of Zero Count in Result of Table()
Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow
How to Remove All Duplicates So That None Are Left in a Data Frame
How to Save a Plot as Image on the Disk
Increasing (Or Decreasing) the Memory Available to R Processes
Replace Values in a Dataframe Based on Lookup Table
Sum Across Multiple Columns With Dplyr
Numbering Rows Within Groups in a Data Frame
How to Select Variables in an R Dataframe Whose Names Contain a Particular String
Count Occurrences of Value in a Set of Variables in R (Per Row)