Divide the Table Data Randomly Based on Percentages

Divide the Table data randomly based on percentages

Answer is similar to those form Michal and his answer is also correct, however NTILE to be used as alternative, since it will split a dataset to 100 equal chunks. ROW_Number will not work for a case for a dataset with a number of rows smaller than 100.:

select id,a.name,
case when rn <= 65 then 'group 1'
else case when rn <= 85 then 'group 2' else 'group 3' end
end
from
(
--newid() will generate random order of records
select ID, name , NTILE(100) OVER (ORDER BY NEWID()) [rn] from dbo.Employees
) [a]

Randomly Dividing and Storing a SQL Table by Percentage

To generate a random distribution, you can order by newid():

select top 20 percent * from mytable order by newid()

You might also want to have a look at the tablesample clause, available since SQL Server 2015. It has an option called repeatable that lets the query return the same random recordset everytime your run it (as long the given seed remains the same and as the table is not modified). This could be handy for your use case:

select top 20 percent * from mytable order by tablesample(20 percent) repeatable(10)

Distribute records based on various percentages with tsql

Passing in the percentages is the hard part. The work is done by percent_rank():

with p as (
select ind, p, (sum(p) over (order by ind) - p) as cume_p
from (values (1, 0.2), (2, 0.2), (3, 0.3), (4, 0.4)) v(ind, p)
)
select t.*, v.grp
from (select t.*, percent_rank() over (order by ?) as pr
from t
) t cross apply
(select max(ind)
from p
where p.cume_p <= t.pr
) v(grp);

How can divide a dataset based on percentage?

First, define a variable for which group it goes in, then use split:

> df$test <- ave(df$ID,df$ID,FUN=function(X) seq_along(X) %% 4 == 1  )
>
> split(df, df$test)
$`0`
ID var value test
2 9442000 v 2.20 0
3 9442000 h 5.30 0
4 9442000 f 0.20 0
6 9442000 t 0.60 0
8 952001 g 0.44 0
9 952001 g 0.44 0
10 952001 h 0.77 0
12 652115 d 1.55 0
13 652115 s 2.55 0
14 652115 s 2.55 0

$`1`
ID var value test
1 9442000 a 2.01 1
5 9442000 s 0.55 1
7 952001 d 0.22 1
11 652115 a 4.66 1

SQL Server - Divide a dataset to same size groups with random rows

Here is one way using Numbers table.

;WITH base_dataset
AS (SELECT *,
Row_number()OVER(ORDER BY id) AS rn
FROM (VALUES (000),
(111),
(222),
(333),
(444),
(555),
(666),
(777),
(888),
(999)) tc (ID)),
keys
AS (SELECT *
FROM (VALUES (1,20),
(2,50),
(3,30)) tc(val, per)),
num_gen
AS (SELECT 1 AS num,
Count(1) AS cnt
FROM base_dataset
UNION ALL
SELECT num + 1,
cnt
FROM num_gen
WHERE num < cnt)
SELECT Id,val
FROM (SELECT Row_number()OVER(ORDER BY newid()) rn,
val
FROM num_gen n
JOIN keys k
ON n.num <= (k.per/100.0) * cnt) a
JOIN base_dataset d
ON d.rn = a.rn

I have used Recursive CTE to generate numbers you can create a numbers table in database and use it

SQL Splitting dataset into 3 sections 60/20/20 for testing

The simplest method is probably just using random():

select t.*,
(case when random() < 0.6 then 'group1'
when random() < 0.5 then 'group2'
else 'group3'
end)
from t;

This is only approximate in the counts. You can get more precision using window functions:

select t.*,
(case when ntile <= 6 then 'group1'
when ntile <= 8 then 'group2'
else 'group3'
end)
from (select t.*,
ntile(10) over (order by random()) as tile
from t
) t

How to randomly split a DataFrame into several smaller DataFrames?

Use np.array_split

shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:
print(part,'\n')
    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5 6 5 0 0 0 0 0 0 5 0 0 0 10
4 5 3 0 0 0 0 0 0 0 0 0 0 3
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
22 23 4 0 0 0 4 3 0 0 5 0 0 16

movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
21 22 4 0 0 0 3 5 5 0 5 4 0 26
1 2 3 0 0 3 0 0 0 0 0 0 0 6
20 21 1 0 0 3 3 0 0 0 0 0 0 7

movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
10 11 2 0 4 0 0 3 3 0 4 2 0 18
9 10 3 2 0 0 0 4 0 0 0 0 0 9
11 12 5 0 0 0 4 5 0 0 5 2 0 21
8 9 5 0 0 0 4 5 0 0 4 5 0 23
12 13 5 4 0 0 2 0 0 0 3 0 0 14

movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
0 1 5 4 0 4 4 0 0 0 4 0 0 21
23 24 3 0 0 4 0 0 0 0 0 3 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22

movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
17 18 4 0 0 0 0 0 0 0 0 0 0 4
2 3 4 0 0 0 0 0 0 0 0 0 0 4
15 16 5 0 0 0 0 0 0 0 4 0 0 9
19 20 4 0 0 0 0 0 0 0 0 0 0 4

How to randomly select a certain percent of rows in Access with another condition?

Since you are already selecting a random portion, the question is really just about selection criteria involving a "total". The key here is that you need another query, an aggregate query. The other query can be either another saved query, an embedded subquery, or a call to a function which performs the query.

Using a subquery to get the total:

SELECT TOP 10 PERCENT * 
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / (SELECT SUM(S2.[Spent]) FROM Students As S2) = 0.30)
ORDER BY rnd(ID)

Make sure to add a different alias to the same table, since Access can get confused if the subquery has a table with the same name as the main query. The question did not mention the "amount spent" column so I just guessed. This also assumes that "both groups" is essentially the same as "all student records". If that's not the case then you could add to the subquery WHERE S2.StudentType In ('girl', 'boy').

Using a domain aggregate function:

SELECT TOP 10 PERCENT * 
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / DSum("[Spent]", "Students", "") = 0.30)
ORDER BY rnd(ID)

Using another saved query:

First create and save the separate aggregate query as [Summed]:

SELECT SUM(S2.[Spent]) As TotalSpent FROM Students As S2

Now do a cross join so that each row is paired with the total:

SELECT TOP 10 PERCENT * 
FROM Students, Summed
WHERE StudentType='girl'
AND (Students.[Spent] / Summed.TotalSpent = 0.30)
ORDER BY rnd(ID)

The efficiency of each solution may vary. For a small table of students it might not matter. If it does become an issue, I have found that the Domain Aggregate functions are not very efficient even though they appear to be simpler to use. More powerful query engines (not Access) are often better at analyzing a query plan and automatically reducing redundant calculations, but with Access you have to plan that out yourself.

Last note: If you have more complicated grouping, any of the solutions will have additional join conditions. For instance, if the aggregate query also had a GROUP BY clause on an ID, then instead of a cross join, you'd now want an INNER JOIN matching the ID in the main table. In the case of the domain aggregate function, you'd want to specify a criteria parameter that refers to a table field value. The point is that the above examples are not a precise template for all cases.

How to split a DataFrame in pandas in predefined percentages?

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074

print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395

print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625


Related Topics



Leave a reply



Submit