Group Records by Time

How can I group time by hour or by 10 minutes?

finally done with

GROUP BY
DATEPART(YEAR, DT.[Date]),
DATEPART(MONTH, DT.[Date]),
DATEPART(DAY, DT.[Date]),
DATEPART(HOUR, DT.[Date]),
(DATEPART(MINUTE, DT.[Date]) / 10)

Group records when date is within N minutes

I believe this produces the results you want:

DECLARE @Comparisons TABLE (i DATETIME, amt INT NOT NULL DEFAULT(5));
INSERT @Comparisons (i) VALUES ('2016-01-01 10:04:00.000')
, ('2016-01-01 10:17:00.000')
, ('2016-01-01 10:25:00.000')
, ('2016-01-01 10:37:00.000')
, ('2016-01-01 10:44:00.000')
, ('2016-01-01 11:52:00.000')
, ('2016-01-01 11:59:00.000')
, ('2016-01-01 12:10:00.000')
, ('2016-01-01 12:22:00.000')
, ('2016-01-01 13:00:00.000')
, ('2016-01-01 09:00:00.000');

DECLARE @N INT = 15;

WITH T AS (
SELECT i
, amt
, CASE WHEN DATEDIFF(MINUTE, previ, i) <= @N THEN 0 ELSE 1 END RN1
, CASE WHEN DATEDIFF(MINUTE, i, nexti) > @N THEN 1 ELSE 0 END RN2
FROM @Comparisons t
OUTER APPLY (SELECT MAX(i) FROM @Comparisons WHERE i < t.i)x(previ)
OUTER APPLY (SELECT MIN(i) FROM @Comparisons WHERE i > t.i)y(nexti)
)
, T2 AS (
SELECT CASE RN1 WHEN 1 THEN i ELSE (SELECT MAX(i) FROM T WHERE RN1 = 1 AND i < T1.i) END mintime
, CASE WHEN RN2 = 1 THEN i ELSE ISNULL((SELECT MIN(i) FROM T WHERE RN2 = 1 AND i > T1.i), i) END maxtime
, amt
FROM T T1
)
SELECT mintime, maxtime, sum(amt) total
FROM T2
GROUP BY mintime, maxtime
ORDER BY mintime;

It's probably a little clunkier than it could be, but it's basically just grouping anything within an @N-minute chain.

Grouping data based on time interval

You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.

df['time'] = pd.to_datetime(df['time']) 
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()

SQL number of rows valid in a time range grouped by time

Try this,

     DECLARE @start_date datetime='2019-01-01',
@end_date datetime='2019-01-02',
@i_minutes int=60

DECLARE @t TABLE
(
id int identity(1,1),time_start datetime,time_end datetime
)
INSERT INTO @t(time_start,time_end)VALUES
('2019-01-01 08:30:00','2019-01-01 09:40:00'),
('2019-01-01 09:10:24','2019-01-01 15:14:19'),
('2019-01-01 09:21:15','2019-01-01 09:21:19'),
('2019-01-01 10:39:45','2019-01-01 10:58:12'),
('2019-01-01 11:39:45','2019-01-01 11:40:10')

--SELECT @start_date=min(time_start),@end_date=max(time_end)
--FROM @t

;WITH CTE_time_Interval AS
(
SELECT @start_date AS time_int,@i_minutes AS i_minutes
UNION ALL
SELECT dateadd(minute,@i_minutes,time_int),i_minutes+ @i_minutes
FROM CTE_time_Interval
WHERE time_int<=@end_date
)
,CTE1 AS
(
SELECT ROW_NUMBER()OVER(ORDER BY time_int)AS r_no,time_int
FROM CTE_time_Interval
)
,CTE2 AS
(
SELECT a.time_int AS Int_start_time,b.time_int AS Int_end_time
FROM CTE1 a
INNER JOIN CTE1 b ON a.r_no+1=b.r_no
)

SELECT a.Int_start_time,a.Int_end_time,sum(iif(b.time_start is not null,1,0)) AS cnt
FROM CTE2 a
LEFT JOIN @t b ON
(
b.time_start BETWEEN a.Int_start_time AND a.Int_end_time
OR
b.time_end BETWEEN a.Int_start_time AND a.Int_end_time
OR
a.Int_start_time BETWEEN b.time_start AND b.time_end
OR
a.Int_end_time BETWEEN b.time_start AND b.time_end
)
GROUP BY a.Int_start_time,a.Int_end_time

SQL Group records by a custom weekly time range

You can use the following condition:

WHERE update_date >= '2018-01-01' -- between Jan 2018
AND update_date < '2018-04-01' -- and Mar 2018
AND (
DATENAME(dw, update_date) = 'Sunday' AND CAST(update_date AS TIME) >= '12:00' OR
DATENAME(dw, update_date) IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday') OR
DATENAME(dw, update_date) = 'Friday' AND CAST(update_date AS TIME) < '13:00'
)

Group records by time intervals based on other entries (Gaps and islands)

This is a "group-and-islands" problem. A simple solution uses row_number() and aggregation:

select user, ip, min(timestamp), max(timestamp)
from (select mtt.*,
row_number() over (partition by ip order by timestamp) as seqnum_t,
row_number() over (partition by ip, user order by timestamp) as seqnum_ut
from #mtt mtt
) mtt
group by ip, user, (seqnum_t - seqnum_ut);

Why this works is a little hard to explain. But, if you run the subquery and stare at the results, you'll see that the difference between the two sequence numbers identifies the groups of adjacent records.

How can can I group data by hour and retain the field with time and date (%Y-%m-%d %H:%M:%S)?

I'd suggest lubridate::floor_date for this. It will round down to the last hour, giving you a datetime for grouping.

summary_df <- long_df %>%
group_by(hour = lubridate::floor_date(time, "1 hour"), discrete_variable) %>%
summarise(max_continuous_variable = max(continuous_variable))

How do I group records that are within a specific time interval using Spark Scala or sql?

org.apache.spark.sql.functions provides overloaded window functions as below.

1. window(timeColumn: Column, windowDuration: String) : Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).

The windows will look like:

  {{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}

2. window((timeColumn: Column, windowDuration: String, slideDuration: String):
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
slideDuration Parameter specifying the sliding interval of the window, e.g. 1 minute.A new window will be generated every slideDuration. Must be less than or equal to the windowDuration.

The windows will look like:

{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}

3. window((timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).

The windows will look like:

{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}

For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. This is the perfect overloaded window function which suites your requirement.

Please find working code as below.

import org.apache.spark.sql.SparkSession

object SparkWindowTest extends App {

val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()

import spark.implicits._
import org.apache.spark.sql.functions._

//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")

df.show()
df.printSchema()

+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+

root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)

//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))

//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")

modifiedDF.show(false)
modifiedDF.printSchema()

+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+

root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)

modifiedDF1.show(false)
modifiedDF1.printSchema()

+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+

root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)

//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")

//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)

finalDF.show(false)

+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+

}

Query Records and Group by a block of time

If you're certain these runs are contiguous and don't overlap, you should be able to use the Id field to break up your groups. Look for Id fields that are only 1 apart AND datecreated fields that are greater than some threshold apart. From your data, it looks like records within a run are entered within at most a minute of each other, so a safe threshold could be a minute or more.

This would get you your start times

SELECT mrtB.Id, mrtB.DateCreated
FROM MyReportTable AS mrtA
INNER JOIN MyReportTable AS mrtB
ON (mrtA.Id + 1) = mrtB.Id
WHERE DateDiff(mi, mrtA.DateCreated, mrtB.DateCreated) >= 1

I'll call that DataRunStarts

Now you can use that to get info about where the groups started and ended

SELECT drsA.Id AS StartID, drsA.DateCreated, Min(drsB.Id) AS ExcludedEndId
FROM DataRunStarts AS drsA, DataRunStarts AS drsB
WHERE (((drsB.Id)>[drsA].[id]))
GROUP BY drsA.Id, drsA.DateCreated

I'll call that DataRunGroups. I called that last field "Excluded" because the id it holds is just going to be used to define the end boundary for the set of ids that will be pulled.

Now we can use DataRunGroups and MyReportTable to get the counts

SELECT DataRunGroups.StartID, Count(MyReportTable.Id) AS CountOfRecords
FROM DataRunGroups, MyReportTable
WHERE (((MyReportTable.Id)>=[StartId] And (MyReportTable.Id)<[ExcludedEndId]))
GROUP BY DataRunGroups.StartID;

I'll call that DataRunCounts

Now we can put DataRunGroups and DataRunCounts together to get start times and counts.

SELECT DataRunGroups.DateCreated, DataRunCounts.CountOfRecords
FROM DataRunGroups
INNER JOIN DataRunCounts
ON DataRunGroups.StartID = DataRunCounts.StartID;

Depending on your setup, you may need to do all of this on one query, but you get the idea. Also, the very first and very last runs wouldn't be included in this, because there'd be no start id to go by for the very first run, and no end id to go by for the very last run. To include those, you would make queries for just those two ranges, and union them together along with the old DataRunGroups query to create a new DataRunGroups. The other queries that use DataRunGroups would work just as described above.



Related Topics



Leave a reply



Submit