Selecting First Row Per Group

Select first row in each GROUP BY group?

On databases that support CTE and windowing functions:

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rank
      FROM PURCHASES p)
 SELECT *
   FROM summary
 WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

Get top 1 row of each group

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

maintain status in 2 places
preserve status history
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

How to select the first row of each group?

Window functions:

Something like this should do the trick:

import org.apache.spark.sql.functions.{row_number, max, broadcast}
import org.apache.spark.sql.expressions.Window

val df = sc.parallelize(Seq(
  (0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
  (1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
  (2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
  (3,"cat8",35.6))).toDF("Hour", "Category", "TotalValue")

val w = Window.partitionBy($"hour").orderBy($"TotalValue".desc)

val dfTop = df.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")

dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+

This method will be inefficient in case of significant data skew. This problem is tracked by SPARK-34775 and might be resolved in the future (SPARK-37099).

Plain SQL aggregation followed by join:

Alternatively you can join with aggregated data frame:

val dfMax = df.groupBy($"hour".as("max_hour")).agg(max($"TotalValue").as("max_value"))

val dfTopByJoin = df.join(broadcast(dfMax),
    ($"hour" === $"max_hour") && ($"TotalValue" === $"max_value"))
  .drop("max_hour")
  .drop("max_value")

dfTopByJoin.show

// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+

It will keep duplicate values (if there is more than one category per hour with the same total value). You can remove these as follows:

dfTopByJoin
  .groupBy($"hour")
  .agg(
    first("category").alias("category"),
    first("TotalValue").alias("TotalValue"))

Using ordering over structs:

Neat, although not very well tested, trick which doesn't require joins or window functions:

val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs"))
  .groupBy($"hour")
  .agg(max("vs").alias("vs"))
  .select($"Hour", $"vs.Category", $"vs.TotalValue")

dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+

With DataSet API (Spark 1.6+, 2.0+):

Spark 1.6:

case class Record(Hour: Integer, Category: String, TotalValue: Double)

df.as[Record]
  .groupBy($"hour")
  .reduce((x, y) => if (x.TotalValue > y.TotalValue) x else y)
  .show

// +---+--------------+
// | _1|            _2|
// +---+--------------+
// |[0]|[0,cat26,30.9]|
// |[1]|[1,cat67,28.5]|
// |[2]|[2,cat56,39.6]|
// |[3]| [3,cat8,35.6]|
// +---+--------------+

Spark 2.0 or later:

df.as[Record]
  .groupByKey(_.Hour)
  .reduceGroups((x, y) => if (x.TotalValue > y.TotalValue) x else y)

The last two methods can leverage map side combine and don't require full shuffle so most of the time should exhibit a better performance compared to window functions and joins. These cane be also used with Structured Streaming in completed output mode.

Don't use:

df.orderBy(...).groupBy(...).agg(first(...), ...)

It may seem to work (especially in the local mode) but it is unreliable (see SPARK-16207, credits to Tzach Zohar for linking relevant JIRA issue, and SPARK-30335).

The same note applies to

df.orderBy(...).dropDuplicates(...)

which internally uses equivalent execution plan.

Select the first row by group

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15

SQL selecting first record per group

GROUP BY u.d (without also listing u1, u2, u3) would only work if u.d was the PRIMARY KEY (which it is not, and also wouldn't make sense in your scenario). See:

Is it possible to have an SQL query that uses AGG functions in this way?

I suggest DISTINCT ON in a subquery on UTable instead:

SELECT o.d, u.u1, u.u2, u.u3, o.n
FROM  (
   SELECT DISTINCT ON (u.d)
          u.d, u.u1, u.u2, u.u3
   FROM   UTable u
   WHERE  u.gid = 3
   AND    u.gt = 'dog night'
   ORDER  BY u.d, u.timestamp
   ) u
JOIN   OTable o USING (gid, gt, d);

See:

Select first row in each GROUP BY group?

If UTable is big, at least a multicolumn index on (gid, gt) is advisable. Same for OTable.

Maybe even on (gid, gt, d). Depends on data types, cardinalities, ...

How to get the first row per group?

if your MySQL version support ROW_NUMBER + window function, you can try to use ROW_NUMBER to get the biggest num by category_id

Query #1

SELECT num,business_id,category_id
FROM (
    SELECT *,ROW_NUMBER() OVER(PARTITION BY category_id ORDER BY num desc) rn
    FROM (
        select count(1) num, business_id, category_id
        from mytable
        group by business_id, category_id
    ) t1
) t1
WHERE rn = 1

num	business_id	category_id
22	5543	8
13	3242	11

data.table - keep first row per group OR based on condition

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 160.0   110  3.90  2.62 16.46     0     1     4     4
# 2:  22.8     4 108.0    93  3.85  2.32 18.61     1     1     4     1
# 3:  18.7     8 360.0   175  3.15  3.44 17.02     0     0     3     2
# 4:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 5:  19.7     6 145.0   175  3.62  2.77 15.50     0     1     5     6
# 6:  15.0     8 301.0   335  3.54  3.57 14.60     0     1     5     8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#  1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#  2:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#  3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#  4:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3
#  5:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#  6:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#  7:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#  8:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#  9:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
# 10:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

How to select the first row for each group in MySQL?

rtribaldos mentioned that in younger database versions, window-functions could be used.

Here is a code which worked for me and was as fast as Martin Zwarík's substring_index-solution (in Mariadb 10.5.16):

SELECT group_col, order_col FROM (
  SELECT group_col, order_col
  , ROW_NUMBER() OVER(PARTITION BY group_col ORDER BY order_col) rnr 
  FROM some_table
  WHERE <some_condition>
) i
WHERE rnr=1;