Distinct Random Time Generation in the Fixed Interval

Distinct random time generation in the fixed interval

Interpretation of Original Question:

The question states:

  • Generate a random time between 8:00 AM and 8:00 PM (i.e. a 12-hour window)
  • It should be different for each row (i.e. unique across all rows)
  • The real table has around 2800 records

Now factor in the following points:

  • Sample data shows only a single date
  • There are 86,400 seconds in 24 hours, hence 43,200 seconds in 12 hours

There is some ambiguity in the following areas:

  • What exactly is random within the context of "different for every row", given that truly random values cannot be guaranteed to be different for every row. In fact, truly random numbers could theoretically be the same for every row. So is the emphasis on "random" or "different"? Or are we really talking about different but not sequentially ordered (to give the appearance of randomness without actually being random)?
  • What if there are ever more than 2800 rows? What if there are 1 million rows?
  • If there can be more than 43,200 rows, how to handle "different for each row" (since it is not possible to have unique across all rows)?
  • Will the date ever vary? If so, are we really talking about "different for each row per date"?
  • If "different for each row per date":

    • Can the times for each date follow the same, non-sequential pattern? Or does the pattern need to differ per each date?
    • Will there ever be more than 43,200 rows for any particular date? If so, the times can only be unique per each set of 43,200 rows.

Given the information above, there are a few ways to interpret the request:

  1. Emphasis on "random": Dates and number of rows don't matter. Generate truly random times that are highly likely, but not guaranteed, to be unique using one of the three methods shown in the other answers:

    • @notulysses: RAND(CAST(NEWID() AS VARBINARY)) * 43200
    • @Steve Ford: ABS(CHECKSUM(NewId()) % 43201)
    • @Vladimir Baranov : CAST(43200000 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) as int)
  2. Emphasis on "different for each row", always <= 43,200 rows: If the number of rows never exceeds the number of available seconds, it is easy to guarantee unique times across all rows, regardless of same or different dates, and appear to be randomly ordered.
  3. Emphasis on "different for each row", could be > 43,200 rows: If the number of rows can exceed the number of available seconds, then it is not possible to guarantee uniqueness across all rows, but it would be possible to still guarantee uniqueness across rows of any particular date, provided that no particular date has > 43,200 rows.

Hence, I based my answer on the idea that:

  • Even if the number of rows for the O.P. never exceeds 2800, it is more likely that most others who are encountering a similar need for randomness would have a larger data set to work with (i.e. there could easily be 1 million rows, for any number of dates: 1, 5000, etc.)
  • Either the sample data is overly simplistic in using the same date for all 5 rows, or even if the date is the same for all rows in this particular case, in most other cases that is less likely to happen
  • Uniqueness is to be favored over Randomness
  • If there is a pattern to the "seemingly random" ordering of the seconds for each date, there should at least be a varying offset to the start of the sequence across the dates (when the dates are ordered sequentially) to give the appearance of randomness between any small grouping of dates.

Answer:

If the situation requires unique times, that cannot be guaranteed with any method of generating truly random values. I really like the use of CRYPT_GEN_RANDOM by @Vladimir Baranov, but it is nearly impossible to get a unique set of values generated:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
SELECT CONVERT(BIGINT, CRYPT_GEN_RANDOM(4))
FROM [master].sys.objects so
CROSS JOIN [master].sys.objects so2
CROSS JOIN [master].sys.objects so3;
-- 753,571 rows

Increasing the random value to 8 bytes does seem to work:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
SELECT CONVERT(BIGINT, CRYPT_GEN_RANDOM(8))
FROM [master].sys.objects so
CROSS JOIN [master].sys.objects so2
CROSS JOIN [master].sys.objects so3;
-- 753,571 rows

Of course, if we are generating down to the second, then there are only 86,400 of those. Reducing the scope seems to help as the following does occasionally work:

DECLARE @Table TABLE (Col1 BIGINT NOT NULL UNIQUE);

INSERT INTO @Table (Col1)
SELECT TOP (86400) CONVERT(BIGINT, CRYPT_GEN_RANDOM(4))
FROM [master].sys.objects so
CROSS JOIN [master].sys.objects so2
CROSS JOIN [master].sys.objects so3;

However, things get a bit trickier if the uniqueness needs per each day (which seems like a reasonable requirement of this type of project, as opposed to unique across all days). But a random number generator isn't going to know to reset at each new day.

If it is acceptable to merely have the appearance of being random, then we can guarantee uniqueness per each date without:

  • looping / cursor constructs
  • saving already used values in a table
  • using RAND(), NEWID(), or CRYPT_GEN_RANDOM()

The following solution uses the concept of Modular Multiplicative Inverses (MMI) which I learned about in this answer: generate seemingly random unique numeric ID in SQL Server . Of course, that question did not have a tightly-defined range of values like we have here with only 86,400 of them per day. So, I used a range of 86400 (as "Modulo") and tried a few "coprime" values (as "Integer") in an online calculator to get their MMIs:

  • 13 (MMI = 39877)
  • 37 (MMI = 51373)
  • 59 (MMI = 39539)

I use ROW_NUMBER() in a CTE, partitioned (i.e. grouped) by CREATED_DATE as a means of assigning each second of the day a value.

But, while the values generated for seconds 0, 1, 2, ... and so on sequentially will appear random, across different days that particular second will map to the same value. So, the second CTE (named "WhichSecond") shifts the starting point for each date by converting the date to an INT (which converts dates to a sequential offset from 1900-01-01) and then multiply by 101.

DECLARE @Data TABLE
(
ID INT NOT NULL IDENTITY(1, 1),
CREATED_DATE DATE NOT NULL
);

INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2014-10-05');
INSERT INTO @Data (CREATED_DATE) VALUES ('2015-03-15');
INSERT INTO @Data (CREATED_DATE) VALUES ('2016-10-22');
INSERT INTO @Data (CREATED_DATE) VALUES ('2015-03-15');

;WITH cte AS
(
SELECT tmp.ID,
CONVERT(DATETIME, tmp.CREATED_DATE) AS [CREATED_DATE],
ROW_NUMBER() OVER (PARTITION BY tmp.CREATED_DATE ORDER BY (SELECT NULL))
AS [RowNum]
FROM @Data tmp
), WhichSecond AS
(
SELECT cte.ID,
cte.CREATED_DATE,
((CONVERT(INT, cte.[CREATED_DATE]) - 29219) * 101) + cte.[RowNum]
AS [ThisSecond]
FROM cte
)
SELECT parts.*,
(parts.ThisSecond % 86400) AS [NormalizedSecond], -- wrap around to 0 when
-- value goes above 86,400
((parts.ThisSecond % 86400) * 39539) % 86400 AS [ActualSecond],
DATEADD(
SECOND,
(((parts.ThisSecond % 86400) * 39539) % 86400),
parts.CREATED_DATE
) AS [DateWithUniqueTime]
FROM WhichSecond parts
ORDER BY parts.ID;

Returns:

ID  CREATED_DATE  ThisSecond  NormalizedSecond  ActualSecond  DateWithUniqueTime
1 2014-10-05 1282297 72697 11483 2014-10-05 03:11:23.000
2 2014-10-05 1282298 72698 51022 2014-10-05 14:10:22.000
3 2014-10-05 1282299 72699 4161 2014-10-05 01:09:21.000
4 2014-10-05 1282300 72700 43700 2014-10-05 12:08:20.000
5 2014-10-05 1282301 72701 83239 2014-10-05 23:07:19.000
6 2015-03-15 1298558 2558 52762 2015-03-15 14:39:22.000
7 2016-10-22 1357845 61845 83055 2016-10-22 23:04:15.000
8 2015-03-15 1298559 2559 5901 2015-03-15 01:38:21.000

If we want to only generate times between 8:00 AM and 8:00 PM, we only need to make a few minor adjustments:

  1. Change the range (as "Modulo") from 86400 to half of it: 43200
  2. Recalculate the MMI (can use the same "coprime" values as "Integer"): 39539 (same as before)
  3. Add 28800 to the second parameter of the DATEADD as an 8 hour offset

The result will be a change to just one line (since the others are diagnostic):

-- second parameter of the DATEADD() call
28800 + (((parts.ThisSecond % 43200) * 39539) % 43200)

Another means of shifting each day in a less predictable fashion would be to make use of RAND() by passing in the INT form of CREATED_DATE in the "WhichSecond" CTE. This would give a stable offset per each date since RAND(x) will return the same value y for the same value of x passed in, but will return a different value y for a different value of x passed in. Meaning:

RAND(1) = y1

RAND(2) = y2

RAND(3) = y3

RAND(2) = y2

The second time RAND(2) was called, it still returned the same value of y2 that it returned the first time it was called.

Hence, the "WhichSecond" CTE could be:

(
SELECT cte.ID,
cte.CREATED_DATE,
(RAND(CONVERT(INT, cte.[CREATED_DATE])) * {some number}) + cte.[RowNum]
AS [ThisSecond]
FROM cte
)

Generate 'n' unique random numbers within a range

If you just need sampling without replacement:

>>> import random
>>> random.sample(range(1, 100), 3)
[77, 52, 45]

random.sample takes a population and a sample size k and returns k random members of the population.

If you have to control for the case where k is larger than len(population), you need to be prepared to catch a ValueError:

>>> try:
... random.sample(range(1, 2), 3)
... except ValueError:
... print('Sample size exceeded population size.')
...
Sample size exceeded population size

Generate random SQL Server 2008 time test data

There are 86,400,000 milliseconds in a day, so you can get a random time value by doing this:

select dateadd(millisecond, cast(86400000 * RAND() as int), convert(time, '00:00'))

For your example where you want times between 8:00 and 9:00, there are 3,600,000 milliseconds in an hour, so modify the query like this.

select dateadd(millisecond, cast(3600000 * RAND() as int), convert(time, '08:00'))

In order to put in into your new table, you might either do a T-SQL loop with updates (s...l...o...w...), or do a SELECT INTO from your original table into a new table.

Generating m distinct random numbers in the range [0..n-1]

Pure mathematics:

Let's calculate the quantity of rand() function calls in both cases and compare the results:

Case 1:
let's see the mathematical expectation of calls on step i = k, when you already have k numbers chosen. The probability to get a number with one rand() call is equal to p = (n-k)/n. We need to know the mathematical expectation of such calls quantity which leads to obtaining a number we don't have yet.

The probability to get it using 1 call is p. Using 2 calls - q * p, where q = 1 - p. In general case, the probability to get it exactly after n calls is (q^(n-1))*p. Thus, the mathematical expectation is

Sum[ n * q^(n-1) * p ], n = 1 --> INF. This sum is equal to 1/p (proved by wolfram alpha).

So, on the step i = k you will perform 1/p = n/(n-k) calls of the rand() function.

Now let's sum it overall:

Sum[ n/(n - k) ], k = 0 --> m - 1 = n * T - the number of rand calls in method 1.

Here T = Sum[ 1/(n - k) ], k = 0 --> m - 1

Case 2:

Here rand() is called inside random_shuffle n - 1 times (in most implementations).

Now, to choose the method, we have to compare these two values: n * T ? n - 1.

So, to choose the appropriate method, calculate T as described above. If T < (n - 1)/n it's better to use the first method. Use the second method otherwise.

How to generate unique random numbers (that don't repeat)?

One straightforward way to do non-repeating 'random' (psudeorandom) whole numbers in a modest range is to create a list using range(1, n), then random.shuffle() the list, and then take as many numbers as you want from the list using pop() or a slice.

import random

max = 11
l = list(range(1, max)) # the cast to list is optional in Python 2
random.shuffle(l)

Now every time you want a random number, just l.pop().

Another is to use random.sample() -- see https://docs.python.org/3/library/random.html

How do I generate a random number for each row in a T-SQL select?

Take a look at SQL Server - Set based random numbers which has a very detailed explanation.

To summarize, the following code generates a random number between 0 and 13 inclusive with a uniform distribution:

ABS(CHECKSUM(NewId())) % 14

To change your range, just change the number at the end of the expression. Be extra careful if you need a range that includes both positive and negative numbers. If you do it wrong, it's possible to double-count the number 0.

A small warning for the math nuts in the room: there is a very slight bias in this code. CHECKSUM() results in numbers that are uniform across the entire range of the sql Int datatype, or at least as near so as my (the editor) testing can show. However, there will be some bias when CHECKSUM() produces a number at the very top end of that range. Any time you get a number between the maximum possible integer and the last exact multiple of the size of your desired range (14 in this case) before that maximum integer, those results are favored over the remaining portion of your range that cannot be produced from that last multiple of 14.

As an example, imagine the entire range of the Int type is only 19. 19 is the largest possible integer you can hold. When CHECKSUM() results in 14-19, these correspond to results 0-5. Those numbers would be heavily favored over 6-13, because CHECKSUM() is twice as likely to generate them. It's easier to demonstrate this visually. Below is the entire possible set of results for our imaginary integer range:


Checksum Integer: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Range Result: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5

You can see here that there are more chances to produce some numbers than others: bias. Thankfully, the actual range of the Int type is much larger... so much so that in most cases the bias is nearly undetectable. However, it is something to be aware of if you ever find yourself doing this for serious security code.

Unique random number generation in an integer array

There are several ways to solve your problem, each has its own advantages and disadvantages.

First I'd like to note that you already got quite a few of responses that do the following: they generate a random number, then check somehow whether it was already used in the array, and if it was already used, they just generate another number until they find an unused one.
This is a naive and, truth to be said, seriously flawed approach. The problem is with the cyclic trial-and-error nature of the number generation ("if already used, try again"). If the numeric range (say, [1..N]) is close to the length of the desired array (say, M), then towards the end the algorithm might spend a huge amount of time trying to find the next number. If the random number generator is even a little bit broken (say, never generates some number, or does it very rarely), then with N == M the algorithm is guaranteed to loop forever (or for a very long time). Generally this trial-and-error approach is a useless one, or a flawed one at best.

Another approach already presented here is generating a random permutation in an array of size N. The idea of random permutation is a promising one, but doing it on an array of size N (when M << N) will certainly generate more heat than light, speaking figuratively.

Good solutions to this problem can be found, for example, in Bentley's "Programming Pearls" (and some of them are taken from Knuth).


  • The Knuth algorithm. This is a very simple algorithm with a complexity of O(N) (i.e. the numeric range), meaning that it is most usable when M is close to N. However, this algorithm doesn't require any extra memory in addition to your vektor array, as opposed to already offered variant with permutations (meaning that it takes O(M) memory, not O(N) as other permutation-based algorithms suggested here). The latter makes it a viable algorithm even for M << N cases.

The algorithm works as follows: iterate through all numbers from 1 to N and select the current number with probability rm / rn, where rm is how many numbers we still need to find, and rn is how many numbers we still need to iterate through. Here's a possible implementation for your case

#define M 10
#define N 100

int in, im;

im = 0;

for (in = 0; in < N && im < M; ++in) {
int rn = N - in;
int rm = M - im;
if (rand() % rn < rm)
/* Take it */
vektor[im++] = in + 1; /* +1 since your range begins from 1 */
}

assert(im == M);

After this cycle we get an array vektor filled with randomly chosen numbers in ascending order. The "ascending order" bit is what we don't need here. So, in order to "fix" that we just make a random permutation of elements of vektor and we are done. Note, that the this is a O(M) permutation requiring no extra memory. (I leave out the implementation of the permutation algorithm. Plenty of links was given here already.).

If you look carefully at the permutation-based algorithms proposed here that operate on an array of length N, you'll see that most of them are pretty much this very same Knuth algorithm, but re-formulated for M == N. In that case the above selection cycle will chose each and every number in [1..N] range with probabilty 1, effectively turning into initialization of an N-array with numbers 1 to N. Taking this into account, I think it becomes rather obvious that running this algorithm for M == N and then truncating the result (possibly discarding most of it) makes much less sense than just running this algorithm in its original form for the original value of M and getting the result right away, without any truncation.


  • The Floyd algorithm (see here). This approach has the complexity of about O(M) (depends on the search structure used), so it is better suitable when M << N. This approach keeps track of already generated random numbers, so it requires extra memory. However, the beauty of it is that it does not make any of those abominable trial-and-error iterations, trying to find an unused random number. This algorithm is guaranteed to generate one unique random number after each call to the random number generator.

Here's a possible implementation for it for your case. (There are different ways to keep track of already used numbers. I'll just use an array of flags, assuming that N is not prohibitively large)

#define M 10
#define N 100

unsigned char is_used[N] = { 0 }; /* flags */
int in, im;

im = 0;

for (in = N - M; in < N && im < M; ++in) {
int r = rand() % (in + 1); /* generate a random number 'r' */

if (is_used[r])
/* we already have 'r' */
r = in; /* use 'in' instead of the generated number */

assert(!is_used[r]);
vektor[im++] = r + 1; /* +1 since your range begins from 1 */
is_used[r] = 1;
}

assert(im == M);

Why the above works is not immediately obvious. But it works. Exactly M numbers from [1..N] range will be picked with uniform distribution.

Note, that for large N you can use a search-based structure to store "already used" numbers, thus getting a nice O(M log M) algorithm with O(M) memory requirement.

(There's one thing about this algorithm though: while the resultant array will not be ordered, a certain "influence" of the original 1..N ordering will still be present in the result. For example, it is obvious that number N, if selected, can only be the very last member of the resultant array. If this "contamination" of the result by the unintended ordering is not acceptable, the resultant vektor array can be random-shuffled, just like in the Khuth algorithm).


Note the very critical point observed in the design of these two algoritms: they never loop, trying to find a new unused random number. Any algorithm that makes trial-and-error iterations with random numbers is flawed from practical point of view. Also, the memory consumption of these algorithms is tied to M, not to N

To the OP I would recommend the Floyd's algorithm, since in his application M seems to be considerably less than N and that it doesn't (or may not) require an extra pass for permutation. However, for such small values of N the difference might be negligible.

How do I create a list of random numbers without duplicates

This will return a list of 10 numbers selected from the range 0 to 99, without duplicates.

import random
random.sample(range(100), 10)

With reference to your specific code example, you probably want to read all the lines from the file once and then select random lines from the saved list in memory. For example:

all_lines = f1.readlines()
for i in range(50):
lines = random.sample(all_lines, 40)

This way, you only need to actually read from the file once, before your loop. It's much more efficient to do this than to seek back to the start of the file and call f1.readlines() again for each loop iteration.

Generating unique random numbers

Following are two possible approaches :-

Method 1 :-

  1. Have all numbers in an array with size n.
  2. select a number an index at random i = rand(0,size)
  3. print arr[i]
  4. swap arr[i] and arr[size-1]
  5. size = size -1
  6. repeat 1 to 5 till list is exhausted.

Time Complexity: O(N)

Space Complexity: O(N)

Method 2:-

  1. Select a pool size of K.
  2. generate K random integers.
  3. show them as first k results.
  4. Add them to a hashset.
  5. Add all ak+1 for all previous integers in a pool.
  6. Add 1 if it is not in hashset.
  7. pick a integer r at random from pool and show it .
  8. Add r+1 into pool if its not in hashset
  9. Do 7 to 8 till pool is exhausted.

Time complexity: O(N)

Space complexity: O(K)

Pro & Cons :-

Method 1: Use this method for small integer ranges as it requires larger space but it is very fast and random.

Method 2: Use this method for larger ranges as it takes memory O(K) which is of your choice. The higher the k the higher is the randomness in the numbers generated. So you can achieve a nice trade off between space and randomness with maintaining good speed.



Related Topics



Leave a reply



Submit