Divide the Table data randomly based on percentages
Answer is similar to those form Michal and his answer is also correct, however NTILE to be used as alternative, since it will split a dataset to 100 equal chunks. ROW_Number will not work for a case for a dataset with a number of rows smaller than 100.:
select id,a.name,
case when rn <= 65 then 'group 1'
else case when rn <= 85 then 'group 2' else 'group 3' end
end
from
(
--newid() will generate random order of records
select ID, name , NTILE(100) OVER (ORDER BY NEWID()) [rn] from dbo.Employees
) [a]
Randomly Dividing and Storing a SQL Table by Percentage
To generate a random distribution, you can order by newid()
:
select top 20 percent * from mytable order by newid()
You might also want to have a look at the tablesample
clause, available since SQL Server 2015. It has an option called repeatable
that lets the query return the same random recordset everytime your run it (as long the given seed remains the same and as the table is not modified). This could be handy for your use case:
select top 20 percent * from mytable order by tablesample(20 percent) repeatable(10)
Distribute records based on various percentages with tsql
Passing in the percentages is the hard part. The work is done by percent_rank()
:
with p as (
select ind, p, (sum(p) over (order by ind) - p) as cume_p
from (values (1, 0.2), (2, 0.2), (3, 0.3), (4, 0.4)) v(ind, p)
)
select t.*, v.grp
from (select t.*, percent_rank() over (order by ?) as pr
from t
) t cross apply
(select max(ind)
from p
where p.cume_p <= t.pr
) v(grp);
How can divide a dataset based on percentage?
First, define a variable for which group it goes in, then use split
:
> df$test <- ave(df$ID,df$ID,FUN=function(X) seq_along(X) %% 4 == 1 )
>
> split(df, df$test)
$`0`
ID var value test
2 9442000 v 2.20 0
3 9442000 h 5.30 0
4 9442000 f 0.20 0
6 9442000 t 0.60 0
8 952001 g 0.44 0
9 952001 g 0.44 0
10 952001 h 0.77 0
12 652115 d 1.55 0
13 652115 s 2.55 0
14 652115 s 2.55 0
$`1`
ID var value test
1 9442000 a 2.01 1
5 9442000 s 0.55 1
7 952001 d 0.22 1
11 652115 a 4.66 1
SQL Server - Divide a dataset to same size groups with random rows
Here is one way using Numbers
table.
;WITH base_dataset
AS (SELECT *,
Row_number()OVER(ORDER BY id) AS rn
FROM (VALUES (000),
(111),
(222),
(333),
(444),
(555),
(666),
(777),
(888),
(999)) tc (ID)),
keys
AS (SELECT *
FROM (VALUES (1,20),
(2,50),
(3,30)) tc(val, per)),
num_gen
AS (SELECT 1 AS num,
Count(1) AS cnt
FROM base_dataset
UNION ALL
SELECT num + 1,
cnt
FROM num_gen
WHERE num < cnt)
SELECT Id,val
FROM (SELECT Row_number()OVER(ORDER BY newid()) rn,
val
FROM num_gen n
JOIN keys k
ON n.num <= (k.per/100.0) * cnt) a
JOIN base_dataset d
ON d.rn = a.rn
I have used Recursive CTE
to generate numbers you can create a numbers table in database and use it
SQL Splitting dataset into 3 sections 60/20/20 for testing
The simplest method is probably just using random()
:
select t.*,
(case when random() < 0.6 then 'group1'
when random() < 0.5 then 'group2'
else 'group3'
end)
from t;
This is only approximate in the counts. You can get more precision using window functions:
select t.*,
(case when ntile <= 6 then 'group1'
when ntile <= 8 then 'group2'
else 'group3'
end)
from (select t.*,
ntile(10) over (order by random()) as tile
from t
) t
How to randomly split a DataFrame into several smaller DataFrames?
Use np.array_split
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)
df.sample(frac=1)
shuffle the rows of df
. Then use np.array_split
split it into parts that have equal size.
It gives you:
for part in result:
print(part,'\n')
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
5 6 5 0 0 0 0 0 0 5 0 0 0 10
4 5 3 0 0 0 0 0 0 0 0 0 0 3
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
22 23 4 0 0 0 4 3 0 0 5 0 0 16
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
21 22 4 0 0 0 3 5 5 0 5 4 0 26
1 2 3 0 0 3 0 0 0 0 0 0 0 6
20 21 1 0 0 3 3 0 0 0 0 0 0 7
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
10 11 2 0 4 0 0 3 3 0 4 2 0 18
9 10 3 2 0 0 0 4 0 0 0 0 0 9
11 12 5 0 0 0 4 5 0 0 5 2 0 21
8 9 5 0 0 0 4 5 0 0 4 5 0 23
12 13 5 4 0 0 2 0 0 0 3 0 0 14
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
0 1 5 4 0 4 4 0 0 0 4 0 0 21
23 24 3 0 0 4 0 0 0 0 0 3 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
17 18 4 0 0 0 0 0 0 0 0 0 0 4
2 3 4 0 0 0 0 0 0 0 0 0 0 4
15 16 5 0 0 0 0 0 0 0 4 0 0 9
19 20 4 0 0 0 0 0 0 0 0 0 0 4
How to randomly select a certain percent of rows in Access with another condition?
Since you are already selecting a random portion, the question is really just about selection criteria involving a "total". The key here is that you need another query, an aggregate query. The other query can be either another saved query, an embedded subquery, or a call to a function which performs the query.
Using a subquery to get the total:
SELECT TOP 10 PERCENT *
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / (SELECT SUM(S2.[Spent]) FROM Students As S2) = 0.30)
ORDER BY rnd(ID)
Make sure to add a different alias to the same table, since Access can get confused if the subquery has a table with the same name as the main query. The question did not mention the "amount spent" column so I just guessed. This also assumes that "both groups" is essentially the same as "all student records". If that's not the case then you could add to the subquery WHERE S2.StudentType In ('girl', 'boy')
.
Using a domain aggregate function:
SELECT TOP 10 PERCENT *
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / DSum("[Spent]", "Students", "") = 0.30)
ORDER BY rnd(ID)
Using another saved query:
First create and save the separate aggregate query as [Summed]:
SELECT SUM(S2.[Spent]) As TotalSpent FROM Students As S2
Now do a cross join so that each row is paired with the total:
SELECT TOP 10 PERCENT *
FROM Students, Summed
WHERE StudentType='girl'
AND (Students.[Spent] / Summed.TotalSpent = 0.30)
ORDER BY rnd(ID)
The efficiency of each solution may vary. For a small table of students it might not matter. If it does become an issue, I have found that the Domain Aggregate functions are not very efficient even though they appear to be simpler to use. More powerful query engines (not Access) are often better at analyzing a query plan and automatically reducing redundant calculations, but with Access you have to plan that out yourself.
Last note: If you have more complicated grouping, any of the solutions will have additional join conditions. For instance, if the aggregate query also had a GROUP BY clause on an ID, then instead of a cross join, you'd now want an INNER JOIN matching the ID in the main table. In the case of the domain aggregate function, you'd want to specify a criteria parameter that refers to a table field value. The point is that the above examples are not a precise template for all cases.
How to split a DataFrame in pandas in predefined percentages?
Use numpy.split
:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
Related Topics
Join Two Tables Based on Relationship Defined in Third Table
Multiple Counts Within a Single SQL Query
Differencebetween a Candidate Key and a Primary Key
Join Tables on Columns of Composite Foreign/Primary Key in a Query
How to Get a View Table Query (Code) in SQL Server 2008 Management Studio
SQL Query in Spark/Scala Size Exceeds Integer.Max_Value
What Is the Optimal Way to Compare Dates in Microsoft SQL Server
Returning the Distinct First Character of a Field (Mysql)
Efficient Implementation of Faceted Search in Relational Databases
Postgresql - Query from Bash Script as Database User 'Postgres'
Oracle Insert into Two Tables in One Query
Postgresql Return 0 If Returned Value Is Null
Rollback Event Triggers in Postgresql
Performance Issue in Update Query
Dbms_Lob.Getlength() VS. Length() to Find Blob Size in Oracle