Select first row in each GROUP BY group?
On databases that support CTE and windowing functions:
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total
Get top 1 row of each group
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
- maintain status in 2 places
- preserve status history
- ...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
Select the first row by group
You can use duplicated
to do this very quickly.
test[!duplicated(test$id),]
Benchmarks, for the speed freaks:
ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
testd <- as.data.table(test)
setkey(testd,id)
# Initial solution (slow)
# testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
# Faster options :
testd[!duplicated(id)] # (1)
# testd[, .SD[1L], by=key(testd)] # (2)
# testd[J(unique(id)),mult="first"] # (3)
# testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc
}
library(plyr)
library(data.table)
library(rbenchmark)
# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), gs1(), gs2(), jply(), jdt(),
replications=5, order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 5 0.03 1.000 0.03 0.00
# 5 jdt() 5 0.03 1.000 0.03 0.00
# 3 gs2() 5 3.49 116.333 2.87 0.58
# 2 gs1() 5 3.58 119.333 3.00 0.58
# 4 jply() 5 3.69 123.000 3.11 0.51
Let's try that again, but with just the contenders from the first heat and with more data and more replications.
set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
# test replications elapsed relative user.self sys.self
# 1 ju() 100 5.48 1.000 4.44 1.00
# 2 jdt() 100 6.92 1.263 5.70 1.15
How to select the first row for each group in MySQL?
When I write
SELECT AnotherColumn
FROM Table
GROUP BY SomeColumn
;
It works. IIRC in other RDBMS such statement is impossible, because a column that doesn't belongs to the grouping key is being referenced without any sort of aggregation.
This "quirk" behaves very closely to what I want. So I used it to get the result I wanted:
SELECT * FROM
(
SELECT * FROM `table`
ORDER BY AnotherColumn
) t1
GROUP BY SomeColumn
;
BigQuery/SQL: Select first row of each group
I believe you are looking for the function [FIRST_VALUE][1]
?
SELECT
landing_page,
FIRST_VALUE(URL)
OVER ( PARTITION BY landing_page ORDER BY Page_Type DESC) AS first_url
FROM `xxxx.TEST.draft`
How to select the first row of each group?
Window functions:
Something like this should do the trick:
import org.apache.spark.sql.functions.{row_number, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
(0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
(1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
(2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
(3,"cat8",35.6))).toDF("Hour", "Category", "TotalValue")
val w = Window.partitionBy($"hour").orderBy($"TotalValue".desc)
val dfTop = df.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
This method will be inefficient in case of significant data skew. This problem is tracked by SPARK-34775 and might be resolved in the future (SPARK-37099).
Plain SQL aggregation followed by join
:
Alternatively you can join with aggregated data frame:
val dfMax = df.groupBy($"hour".as("max_hour")).agg(max($"TotalValue").as("max_value"))
val dfTopByJoin = df.join(broadcast(dfMax),
($"hour" === $"max_hour") && ($"TotalValue" === $"max_value"))
.drop("max_hour")
.drop("max_value")
dfTopByJoin.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
It will keep duplicate values (if there is more than one category per hour with the same total value). You can remove these as follows:
dfTopByJoin
.groupBy($"hour")
.agg(
first("category").alias("category"),
first("TotalValue").alias("TotalValue"))
Using ordering over structs
:
Neat, although not very well tested, trick which doesn't require joins or window functions:
val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs"))
.groupBy($"hour")
.agg(max("vs").alias("vs"))
.select($"Hour", $"vs.Category", $"vs.TotalValue")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
With DataSet API (Spark 1.6+, 2.0+):
Spark 1.6:
case class Record(Hour: Integer, Category: String, TotalValue: Double)
df.as[Record]
.groupBy($"hour")
.reduce((x, y) => if (x.TotalValue > y.TotalValue) x else y)
.show
// +---+--------------+
// | _1| _2|
// +---+--------------+
// |[0]|[0,cat26,30.9]|
// |[1]|[1,cat67,28.5]|
// |[2]|[2,cat56,39.6]|
// |[3]| [3,cat8,35.6]|
// +---+--------------+
Spark 2.0 or later:
df.as[Record]
.groupByKey(_.Hour)
.reduceGroups((x, y) => if (x.TotalValue > y.TotalValue) x else y)
The last two methods can leverage map side combine and don't require full shuffle so most of the time should exhibit a better performance compared to window functions and joins. These cane be also used with Structured Streaming in completed
output mode.
Don't use:
df.orderBy(...).groupBy(...).agg(first(...), ...)
It may seem to work (especially in the local
mode) but it is unreliable (see SPARK-16207, credits to Tzach Zohar for linking relevant JIRA issue, and SPARK-30335).
The same note applies to
df.orderBy(...).dropDuplicates(...)
which internally uses equivalent execution plan.
SQL selecting first record per group
GROUP BY u.d
(without also listing u1
, u2
, u3
) would only work if u.d
was the PRIMARY KEY
(which it is not, and also wouldn't make sense in your scenario). See:
- Is it possible to have an SQL query that uses AGG functions in this way?
I suggest DISTINCT ON
in a subquery on UTable
instead:
SELECT o.d, u.u1, u.u2, u.u3, o.n
FROM (
SELECT DISTINCT ON (u.d)
u.d, u.u1, u.u2, u.u3
FROM UTable u
WHERE u.gid = 3
AND u.gt = 'dog night'
ORDER BY u.d, u.timestamp
) u
JOIN OTable o USING (gid, gt, d);
See:
- Select first row in each GROUP BY group?
If UTable
is big, at least a multicolumn index on (gid, gt)
is advisable. Same for OTable
.
Maybe even on (gid, gt, d)
. Depends on data types, cardinalities, ...
How to get the first row per group?
if your MySQL version support ROW_NUMBER
+ window function, you can try to use ROW_NUMBER
to get the biggest num
by category_id
Query #1
SELECT num,business_id,category_id
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY category_id ORDER BY num desc) rn
FROM (
select count(1) num, business_id, category_id
from mytable
group by business_id, category_id
) t1
) t1
WHERE rn = 1
num | business_id | category_id |
---|---|---|
22 | 5543 | 8 |
13 | 3242 | 11 |
Select the first row of the last group
WITH
gaps AS
(
SELECT
*,
CASE WHEN from_date = LEAD(to_date) OVER (ORDER BY from_date DESC) THEN 0 ELSE 1 END AS is_first_row_of_group
FROM
your_table
)
SELECT TOP (1)
from_date
FROM
gaps
WHERE
is_first_row_of_group = 1
ORDER BY
from_date DESC
Select first row in each group in sql
You can use distinct on
directly with group by
:
select distinct on ("Country") Sum("Price"), "In-app Product", "Country"
from cleandatase
group by "Country", "In-app Product"
order by "Country", Sum("Price") desc;
Note: As Thorsten points out, if there are ties and you want all the ties, then distinct on
is not the simplest solution.
Related Topics
Convert Categorical Variables to Numeric in R
How to Select Variables in an R Dataframe Whose Names Contain a Particular String
Faster Ways to Calculate Frequencies and Cast from Long to Wide
Remove Rows With All or Some Nas (Missing Values) in Data.Frame
Change the Class from Factor to Numeric of Many Columns in a Data Frame
Convert Dataframe Column to 1 or 0 for "True"/"False" Values and Assign to Dataframe
Loop Through Data Frame and Variable Names
Using Ifelse Statement on the Whole Dataset Instead of a Single Column
Remove Unwanted Symbols from Expression Function - R
R - Getting Characters After Symbol
Calculate Row Means on Subset of Columns
Duplicate Columns in Spark Dataframe
Combine Two Lists in a Dataframe in R
Easier Way to Use Grepl and Ifelse Across Multiple Columns
Update Data Frame Via Function Doesn't Work
Extract Row Corresponding to Minimum Value of a Variable by Group