efficiently generate a random sample of times and dates between two dates
Ahh, another date/time problem we can reduce to working in floats :)
Try this function
R> latemail <- function(N, st="2012/01/01", et="2012/12/31") {
+ st <- as.POSIXct(as.Date(st))
+ et <- as.POSIXct(as.Date(et))
+ dt <- as.numeric(difftime(et,st,unit="sec"))
+ ev <- sort(runif(N, 0, dt))
+ rt <- st + ev
+ }
R>
We compute the difftime
in seconds, and then "merely" draw uniforms over it, sorting the result. Add that to the start and you're done:
R> set.seed(42); print(latemail(5)) ## round to date, or hour, or ...
[1] "2012-04-14 05:34:56.369022 CDT" "2012-08-22 00:41:26.683809 CDT"
[3] "2012-10-29 21:43:16.335659 CDT" "2012-11-29 15:42:03.387701 CST"
[5] "2012-12-07 18:46:50.233761 CST"
R> system.time(latemail(100000))
user system elapsed
0.024 0.000 0.021
R> system.time(latemail(200000))
user system elapsed
0.044 0.000 0.045
R> system.time(latemail(10000000)) ## a few more than in your example :)
user system elapsed
3.240 0.172 3.428
R>
Sample time between time range 08:00:00 and 15:00:00
Create sequence of times within the specific duration and sample
. The time would have todays date, to get only time component we use format
.
all_times <- format(seq(as.POSIXct('08:00:00', format = "%T"),
as.POSIXct('15:00:00', format = "%T"), by = "sec"), "%T")
sample(all_times, 3)
#[1] "11:51:16" "09:50:10" "13:09:21"
Generate dates between two dates in a dataframe
Assuming that the 'min_date/max_date' columns are Date
class, we use Map
to get the sequence of each 'min_date' with the corresponding 'max_date' in a list
, replicate the sequence of rows of 'df1' with the number of rows of the list
elements, create a data.frame
by expanding the dataset based on 'i1' and get create 'dates' by concatenating the 'lst' elements.
lst <- Map(function(x, y) seq(x,y, by = "1 day"), df1$min_date, df1$max_date)
i1 <- rep(1:nrow(df1), lengths(lst))
data.frame(df1[i1,-3], dates = do.call("c", lst))
Or if we are using dplyr
library(dplyr)
df1 %>%
rowwise() %>%
do(data.frame(.[1:2], date = seq(.$min_date, .$max_date, by = "1 day")))
Or using data.table
, we can do this in a single line of code
library(data.table)
setDT(df1)[,.(date = seq(min_date, max_date, by = "1 day")) ,.(id1, id2)]
What is a good way to select random dates over a given interval using R?
I would use sample.int
:
Start <- as.Date("2013-01-01")
End <- as.Date("2013-01-31")
Samp <- Start + sample.int(End-Start, 5)
How to generate a random date and time between two dates?
Time.at((date2.to_f - date1.to_f)*rand + date1.to_f)
You'll get a time object that is between two given datetimes.
Generate random list of timestamps within multiple time intervals in python
Here is a way to do it: the idea is that if we remove the total duration of the periods from the time available, generate start times in the period that is left, and then postpone them with the cumulated periods before them, we are sure that the intervals won't overlap.
from datetime import datetime, timedelta
import random
def generate_periods(start, end, durations):
durations = [timedelta(minutes=m) for m in durations]
total_duration = sum(durations, timedelta())
nb_periods = len(durations)
open_duration = (end - start) - total_duration
delays = sorted(timedelta(seconds=s)
for s in random.sample(range(0, int(open_duration.total_seconds())), nb_periods))
periods = []
periods_before = timedelta()
for delay, duration in zip(delays, durations):
periods.append((start + delay + periods_before,
start + delay + periods_before + duration))
periods_before += duration
return periods
Sample run:
durations = [32, 24, 4, 20, 40, 8, 27, 18, 3, 4]
start_time = datetime(2019, 9, 2, 9, 0, 0)
end_time = datetime(2019, 9, 2, 17, 0, 0)
generate_periods(start_time, end_time, durations)
# [(datetime.datetime(2019, 9, 2, 9, 16, 1),
# datetime.datetime(2019, 9, 2, 9, 48, 1)),
# (datetime.datetime(2019, 9, 2, 9, 58, 57),
# datetime.datetime(2019, 9, 2, 10, 22, 57)),
# (datetime.datetime(2019, 9, 2, 10, 56, 41),
# datetime.datetime(2019, 9, 2, 11, 0, 41)),
# (datetime.datetime(2019, 9, 2, 11, 2, 37),
# datetime.datetime(2019, 9, 2, 11, 22, 37)),
# (datetime.datetime(2019, 9, 2, 11, 48, 17),
# datetime.datetime(2019, 9, 2, 12, 28, 17)),
# (datetime.datetime(2019, 9, 2, 13, 4, 28),
# datetime.datetime(2019, 9, 2, 13, 12, 28)),
# (datetime.datetime(2019, 9, 2, 15, 13, 3),
# datetime.datetime(2019, 9, 2, 15, 40, 3)),
# (datetime.datetime(2019, 9, 2, 16, 6, 44),
# datetime.datetime(2019, 9, 2, 16, 24, 44)),
# (datetime.datetime(2019, 9, 2, 16, 37, 42),
# datetime.datetime(2019, 9, 2, 16, 40, 42)),
# (datetime.datetime(2019, 9, 2, 16, 42, 50),
# datetime.datetime(2019, 9, 2, 16, 46, 50))]
generate random dates within a range in numpy
There is a much easier way to achieve this, without needing to explicitly call any libraries beyond numpy.
Numpy has a datetime datatype that is quite powerful: specifically for this case you can add and subtract integers and it treats it like the smallest time unit available. for example, for a %Y-%m-%d format:
exampledatetime1 = np.datetime64('2017-01-01')
exampledatetime1 + 1
>>
2017-01-02
however, for a %Y-%m-%d %H:%M:%S format:
exampledatetime2 = np.datetime64('2017-01-01 00:00:00')
exampledatetime2 + 1
>>
2017-01-01 00:00:01
in this case, as you only have information down to a day resolution, you can simply do the following:
import numpy as np
bimonthly_days = np.arange(0, 60)
base_date = np.datetime64('2017-01-01')
random_date = base_date + np.random.choice(bimonthly_days)
or if you wanted to be even cleaner about it:
import numpy as np
def random_date_generator(start_date, range_in_days):
days_to_add = np.arange(0, range_in_days)
random_date = np.datetime64(start_date) + np.random.choice(days_to_add)
return random_date
and then just use:
yourdate = random_date_generator('2012-01-15', 60)
Generate list of months between interval in python
>>> from datetime import datetime, timedelta
>>> from collections import OrderedDict
>>> dates = ["2014-10-10", "2016-01-07"]
>>> start, end = [datetime.strptime(_, "%Y-%m-%d") for _ in dates]
>>> OrderedDict(((start + timedelta(_)).strftime(r"%b-%y"), None) for _ in xrange((end - start).days)).keys()
['Oct-14', 'Nov-14', 'Dec-14', 'Jan-15', 'Feb-15', 'Mar-15', 'Apr-15', 'May-15', 'Jun-15', 'Jul-15', 'Aug-15', 'Sep-15', 'Oct-15', 'Nov-15', 'Dec-15', 'Jan-16']
Update: a bit of explanation, as requested in one comment. There are three problems here: parsing the dates into appropriate data structures (strptime
); getting the date range given the two extremes and the step (one month); formatting the output dates (strftime
). The datetime
type overloads the subtraction operator, so that end - start
makes sense. The result is a timedelta
object that represents the difference between the two dates, and the .days
attribute gets this difference expressed in days. There is no .months
attribute, so we iterate one day at a time and convert the dates to the desired output format. This yields a lot of duplicates, which the OrderedDict
removes while keeping the items in the right order.
Now this is simple and concise because it lets the datetime module do all the work, but it's also horribly inefficient. We're calling a lot of methods for each day while we only need to output months. If performance is not an issue, the above code will be just fine. Otherwise, we'll have to work a bit more. Let's compare the above implementation with a more efficient one:
from datetime import datetime, timedelta
from collections import OrderedDict
dates = ["2014-10-10", "2016-01-07"]
def monthlist_short(dates):
start, end = [datetime.strptime(_, "%Y-%m-%d") for _ in dates]
return OrderedDict(((start + timedelta(_)).strftime(r"%b-%y"), None) for _ in xrange((end - start).days)).keys()
def monthlist_fast(dates):
start, end = [datetime.strptime(_, "%Y-%m-%d") for _ in dates]
total_months = lambda dt: dt.month + 12 * dt.year
mlist = []
for tot_m in xrange(total_months(start)-1, total_months(end)):
y, m = divmod(tot_m, 12)
mlist.append(datetime(y, m+1, 1).strftime("%b-%y"))
return mlist
assert monthlist_fast(dates) == monthlist_short(dates)
if __name__ == "__main__":
from timeit import Timer
for func in "monthlist_short", "monthlist_fast":
print func, Timer("%s(dates)" % func, "from __main__ import dates, %s" % func).timeit(1000)
On my laptop, I get the following output:
monthlist_short 2.3209939003
monthlist_fast 0.0774540901184
The concise implementation is about 30 times slower, so I would not recommend it in time-critical applications :)
Related Topics
Unlist Data Frame Column Preserving Information from Other Column
How to Number/Label Data-Table by Group-Number from Group_By
Clang-7: Error: Linker Command Failed With Exit Code 1 For Macos Big Sur
Create a Group Number For Each Consecutive Sequence
Summarizing Multiple Columns With Data.Table
Merging Two Data Frames Using Fuzzy/Approximate String Matching in R
Pass Arguments to Dplyr Functions
How to Change the Order of Facet Labels in Ggplot (Custom Facet Wrap Labels)
Why Is the Parallel Package Slower Than Just Using Apply
Convert Column With Pipe Delimited Data into Dummy Variables
Summarizing by Subgroup Percentage in R
Returning Multiple Objects in an R Function
How Can One Work Fully Generically in Data.Table in R With Column Names in Variables
Assign Multiple New Variables on Lhs in a Single Line
Consistent Width For Geom_Bar in the Event of Missing Data
Creating a Comma Separated Vector