Select first row in each GROUP BY group?
On databases that support CTE and windowing functions:
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total
Get top 1 row of each group
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
- maintain status in 2 places
- preserve status history
- ...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
Selecting first row per group
SELECT a, b, c
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY a ORDER BY b, c) rn
FROM mytable
) q
WHERE rn = 1
ORDER BY
a
or
SELECT mi.*
FROM (
SELECT DISTINCT a
FROM mytable
) md
CROSS APPLY
(
SELECT TOP 1 *
FROM mytable mi
WHERE mi.a = md.a
ORDER BY
b, c
) mi
ORDER BY
a
Create a composite index on (a, b, c)
for the queries to work faster.
Which one is more efficient depends on your data distribution.
If you have few distinct values of a
but lots of records within each a
, the second query would be better.
You could improve it even more by creating an indexed view:
CREATE VIEW v_mytable_da
WITH SCHEMABINDING
AS
SELECT a, COUNT_BIG(*) cnt
FROM dbo.mytable
GROUP BY
a
GO
CREATE UNIQUE CLUSTERED INDEX
pk_vmytableda_a
ON v_mytable_da (a)
GO
SELECT mi.*
FROM v_mytable_da md
CROSS APPLY
(
SELECT TOP 1 *
FROM mytable mi
WHERE mi.a = md.a
ORDER BY
b, c
) mi
ORDER BY
a
How do I select the first row per group in an SQL Query?
declare @sometable table ( foo int, bar int, value int )
insert into @sometable values (47, 1, 100)
insert into @sometable values (47, 0, 10)
insert into @sometable values (47, 2, 10)
insert into @sometable values (46, 0, 100)
insert into @sometable values (46, 1, 10)
insert into @sometable values (46, 2, 10)
insert into @sometable values (44, 0, 2)
WITH cte AS
(
SELECT Foo, Bar, SUM(value) AS SumValue, ROW_NUMBER() OVER(PARTITION BY Foo ORDER BY FOO DESC, SUM(value) DESC) AS RowNumber
FROM @SomeTable
GROUP BY Foo, Bar
)
SELECT *
FROM cte
WHERE RowNumber = 1
sql server select first row from a group
If as you indicated, order doesn't matter, any aggregate function on b
would be sufficient.
Example Using MIN
SELECT a, b = MIN(b)
FROM YourTable
GROUP BY
a
How to get the first row per group?
if your MySQL version support ROW_NUMBER
+ window function, you can try to use ROW_NUMBER
to get the biggest num
by category_id
Query #1
SELECT num,business_id,category_id
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY category_id ORDER BY num desc) rn
FROM (
select count(1) num, business_id, category_id
from mytable
group by business_id, category_id
) t1
) t1
WHERE rn = 1
num | business_id | category_id |
---|---|---|
22 | 5543 | 8 |
13 | 3242 | 11 |
SQL selecting first record per group
GROUP BY u.d
(without also listing u1
, u2
, u3
) would only work if u.d
was the PRIMARY KEY
(which it is not, and also wouldn't make sense in your scenario). See:
- Is it possible to have an SQL query that uses AGG functions in this way?
I suggest DISTINCT ON
in a subquery on UTable
instead:
SELECT o.d, u.u1, u.u2, u.u3, o.n
FROM (
SELECT DISTINCT ON (u.d)
u.d, u.u1, u.u2, u.u3
FROM UTable u
WHERE u.gid = 3
AND u.gt = 'dog night'
ORDER BY u.d, u.timestamp
) u
JOIN OTable o USING (gid, gt, d);
See:
- Select first row in each GROUP BY group?
If UTable
is big, at least a multicolumn index on (gid, gt)
is advisable. Same for OTable
.
Maybe even on (gid, gt, d)
. Depends on data types, cardinalities, ...
How to select the first row of each group?
Window functions:
Something like this should do the trick:
import org.apache.spark.sql.functions.{row_number, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
(0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
(1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
(2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
(3,"cat8",35.6))).toDF("Hour", "Category", "TotalValue")
val w = Window.partitionBy($"hour").orderBy($"TotalValue".desc)
val dfTop = df.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
This method will be inefficient in case of significant data skew. This problem is tracked by SPARK-34775 and might be resolved in the future (SPARK-37099).
Plain SQL aggregation followed by join
:
Alternatively you can join with aggregated data frame:
val dfMax = df.groupBy($"hour".as("max_hour")).agg(max($"TotalValue").as("max_value"))
val dfTopByJoin = df.join(broadcast(dfMax),
($"hour" === $"max_hour") && ($"TotalValue" === $"max_value"))
.drop("max_hour")
.drop("max_value")
dfTopByJoin.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
It will keep duplicate values (if there is more than one category per hour with the same total value). You can remove these as follows:
dfTopByJoin
.groupBy($"hour")
.agg(
first("category").alias("category"),
first("TotalValue").alias("TotalValue"))
Using ordering over structs
:
Neat, although not very well tested, trick which doesn't require joins or window functions:
val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs"))
.groupBy($"hour")
.agg(max("vs").alias("vs"))
.select($"Hour", $"vs.Category", $"vs.TotalValue")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
With DataSet API (Spark 1.6+, 2.0+):
Spark 1.6:
case class Record(Hour: Integer, Category: String, TotalValue: Double)
df.as[Record]
.groupBy($"hour")
.reduce((x, y) => if (x.TotalValue > y.TotalValue) x else y)
.show
// +---+--------------+
// | _1| _2|
// +---+--------------+
// |[0]|[0,cat26,30.9]|
// |[1]|[1,cat67,28.5]|
// |[2]|[2,cat56,39.6]|
// |[3]| [3,cat8,35.6]|
// +---+--------------+
Spark 2.0 or later:
df.as[Record]
.groupByKey(_.Hour)
.reduceGroups((x, y) => if (x.TotalValue > y.TotalValue) x else y)
The last two methods can leverage map side combine and don't require full shuffle so most of the time should exhibit a better performance compared to window functions and joins. These cane be also used with Structured Streaming in completed
output mode.
Don't use:
df.orderBy(...).groupBy(...).agg(first(...), ...)
It may seem to work (especially in the local
mode) but it is unreliable (see SPARK-16207, credits to Tzach Zohar for linking relevant JIRA issue, and SPARK-30335).
The same note applies to
df.orderBy(...).dropDuplicates(...)
which internally uses equivalent execution plan.
SQL Server Group By Query Select first row each group
You can GROUP BY StudyID, Year
and then in an outer query select the first row from each StudyID, Year
group:
SELECT StudyID, Year, minAccess1, minAccess2, minAccess3
FROM (
SELECT StudyID, Year, min(Access1) minAccess1, min(Access2) minAccess2,
min(Access3) minAccess3,
ROW_NUMBER() OVER (PARTITION BY StudyID ORDER BY Year DESC) AS rn
FROM mytable
GROUP BY StudyID, Year ) t
WHERE t.rn = 1
ROW_NUMBER
is used to assign an ordering number to each StudyID
group according to Year
values. The row with the maximum Year
value is assigned a rn = 1
.
Select the first row for each group in MySQL?
You can GROUP BY and pick the MAX position.
SELECT ri.*
FROM (
SELECT ri.release_id, MAX(ri.position) AS position
FROM release_image ri
GROUP BY ri.release_id
) ri_max
INNER JOIN release_image ri ON ri_max.release_id = ri.release_id
AND ri_max.position = ri.position
Related Topics
SQL - Combining Multiple Like Queries
SQL Join Where to Place the Where Condition
SQL Row_Number() Function in Where Clause Without Order By
Postgres: Define a Default Value for Cast Failures
How to Select the Most Frequently Appearing Values
Three Table Join with Joins Other Than Inner Join
Combining Results of Two Select Statements
Ssdt Failing to Publish: "Unable to Connect to Master or Target Server"
How to Use Regular Expression in SQL Server
How to Clear Oracle Execution Plan Cache for Benchmarking
How to Reset an MySQL Autoincrement Using a Max Value from Another Table
Regular Expressions Inside SQL Server
Get the Nearest Longitude and Latitude from Mssql Database Table