SQL groupby argmax
This can work, if the `group_id, member_id, value' combination is unique.
SELECT
x.group_id,
x.member_id,
x.value
FROM
table x
join
(SELECT
group_id,
max(t.value) as max_v
FROM table t
GROUP BY group_id
) y
ON y.max_v = x.value
AND y.group_id = y.group_id
how to get moving window argmax in PostgreSQL
Although a bit hard to read, you could do the following:
- Put all values for price inside this window into an array;
- Use
array_position
to find the value of the rolling max price; - Adjust for
row_number()
by addingrow_number() - 10
(the window size) to the output; - Adjust for the start of the array by using
GREATEST(row_number() - 10, 0)
to prevent negative numbers:
WITH sample_table(s_id, s_date, price) AS (
VALUES ('ABC', '2020-06-10'::date, 322.390),
('ABC', '2020-06-11'::date, 312.150),
('ABC', '2020-06-12'::date, 309.080),
('ABC', '2020-06-15'::date, 308.280),
('ABC', '2020-06-16'::date, 315.640),
('ABC', '2020-06-17'::date, 314.390),
('ABC', '2020-06-18'::date, 312.300),
('ABC', '2020-06-19'::date, 314.380),
('ABC', '2020-06-22'::date, 311.050),
('ABC', '2020-06-23'::date, 314.500),
('ABC', '2020-06-24'::date, 310.510),
('ABC', '2020-06-25'::date, 307.640),
('ABC', '2020-06-26'::date, 306.390),
('ABC', '2020-06-29'::date, 304.610),
('ABC', '2020-06-30'::date, 310.200),
('ABC', '2020-07-01'::date, 311.890),
('ABC', '2020-07-02'::date, 315.700),
('ABC', '2020-07-06'::date, 317.680)
)
SELECT s_id,
s_date,
price,
row_number() over (PARTITION BY s_id ORDER BY s_date),
max(price) over (partition by s_id order by s_date rows 10 preceding) as roll_max,
GREATEST(row_number() over (PARTITION BY s_id ORDER BY s_date) - 10, 0)
+ array_position(
array_agg(price) over (partition by s_id order by s_date rows 10 preceding),
max(price) over (partition by s_id order by s_date rows 10 preceding)
) as argmax
FROM sample_table
or, with a subquery, but easier to read:
WITH sample_table(s_id, s_date, price) AS (
VALUES ('ABC', '2020-06-10'::date, 322.390),
('ABC', '2020-06-11'::date, 312.150),
('ABC', '2020-06-12'::date, 309.080),
('ABC', '2020-06-15'::date, 308.280),
('ABC', '2020-06-16'::date, 315.640),
('ABC', '2020-06-17'::date, 314.390),
('ABC', '2020-06-18'::date, 312.300),
('ABC', '2020-06-19'::date, 314.380),
('ABC', '2020-06-22'::date, 311.050),
('ABC', '2020-06-23'::date, 314.500),
('ABC', '2020-06-24'::date, 310.510),
('ABC', '2020-06-25'::date, 307.640),
('ABC', '2020-06-26'::date, 306.390),
('ABC', '2020-06-29'::date, 304.610),
('ABC', '2020-06-30'::date, 310.200),
('ABC', '2020-07-01'::date, 311.890),
('ABC', '2020-07-02'::date, 315.700),
('ABC', '2020-07-06'::date, 317.680)
)
SELECT s_id, s_date, price, row_number, roll_max,
GREATEST(row_number - 10, 0)
+ array_position(
prices,
roll_max
) as argmax
FROM (
SELECT s_id,
s_date,
price,
row_number() over (PARTITION BY s_id ORDER BY s_date),
max(price) over (partition by s_id order by s_date rows 10 preceding) as roll_max,
array_agg(price)
over (partition by s_id order by s_date rows 10 preceding) as prices
FROM sample_table
) as s
Getting row with MAX value together with SUM
This can be easily achieved using window functions:
SELECT a, b, c, s
FROM (
SELECT a, b, c,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn,
SUM(b) OVER (PARTITION BY a) AS s
FROM example) AS t
WHERE t.rn = 1
ROW_NUMBER
enumerates records within eacha
partition: the record having the highestb
value is assigned a value of 1, next record a value of 2, etc.SUM(b) OVER (PARTITION BY a)
returns the sum of allb
within eacha
partition.
Rows with max value of each group
You need a subselect:
SELECT yourtable.*
FROM yourtable
LEFT JOIN (
SELECT grp_id, MAX(created) AS max
FROM yourtable
GROUP BY grp_id
) AS maxgroup ON (
(yourtable.grp_id = maxgroup.grp_id) AND (yourtable.created = maxgroup.max)
)
subselect the gets the ID/max value for each group, and the parent/outer query joins agains the subselect results to get the rest of the fields for the row(s) that the max value appears on.
BigQuery argmax: Is array order maintained when doing CROSS JOIN UNNEST
Short answer: no, order is not guaranteed to be maintained.
Long answer: in practice, you'll most likely see that order is maintained, but you should not depend on it. The example that you provided is similar to this type of query:
SELECT *
FROM (
SELECT 3 AS x UNION ALL
SELECT 2 UNION ALL
SELECT 1
ORDER BY x
)
What is the expected order of the output? The ORDER BY
is in the subquery, and the outer query doesn't impose any ordering, so BigQuery (or whatever engine you run this in) is free to reorder the rows in the output as it sees fit. You may end up getting back 1, 2, 3
, or you may receive 3, 2, 1
or any other ordering. The more general principle is that projections are not order-preserving.
While arrays have a well-defined order of their elements, when you use the UNNEST
function, you're converting the array into a relation, which doesn't have a well-defined order unless you use ORDER BY
. For example, consider this query:
SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)
The new_arr
array isn't actually guaranteed to have the elements [2, 3, 4]
in that order, since the query inside the ARRAY
function doesn't use ORDER BY
. You can address this non-determinism by ordering based on the element offsets, however:
SELECT ARRAY(SELECT x + 1 FROM UNNEST(arr) AS x WITH OFFSET ORDER BY OFFSET) AS new_arr
FROM (SELECT [1, 2, 3] AS arr)
Now the output is guaranteed to be [2, 3, 4]
.
Going back to your original question, you can ensure that you get deterministic output by imposing an ordering in the subquery that computes the row numbers:
ranked_predictions AS (
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY OFFSET) AS rownum,
DENSE_RANK() OVER (PARTITION BY id ORDER BY flattened_prediction DESC) AS array_rank
FROM
predictions P
CROSS JOIN
UNNEST(P.prediction) AS flattened_prediction WITH OFFSET
)
I added the WITH OFFSET
after the UNNEST
, and ORDER BY OFFSET
inside the ROW_NUMBER
window in order to ensure that the row numbers are computed based on the original ordering of the array elements.
How do I do argmax in KDB?
A few different ways, but here is one using the find operator ?
1 2 3?max 1 2 3
argmax in Spark DataFrames: how to retrieve the row with the maximum value
If schema is Orderable
(schema contains only atomics / arrays of atomics / recursively orderable structs) you can use simple aggregations:
Python:
df.select(F.max(
F.struct("values", *(x for x in df.columns if x != "values"))
)).first()
Scala:
df.select(max(struct(
$"values" +: df.columns.collect {case x if x!= "values" => col(x)}: _*
))).first
Otherwise you can reduce over Dataset
(Scala only) but it requires additional deserialization:
type T = ???
df.reduce((a, b) => if (a.getAs[T]("values") > b.getAs[T]("values")) a else b)
You can also oredrBy
and limit(1)
/ take(1)
:
Scala:
df.orderBy(desc("values")).limit(1)
// or
df.orderBy(desc("values")).take(1)
Python:
df.orderBy(F.desc('values')).limit(1)
# or
df.orderBy(F.desc("values")).take(1)
Select only rows with max date
Your query returns what you need - only one row for each _id where column _status_set_at has its max value.
You do not need to change anything in your original query.
count(_id) shows how many rows for each _id in the original table, but not in a query result.
Query result has only one row for each _id because you group by _id.
This query shows that in your query result there is only one row for each _id
SELECT _id, max_status_set_at, count(_id) FROM (
SELECT _id, max(_status_set_at) max_status_set_at
FROM pikta.candidates_states
GROUP BY _id) t
GROUP BY _id
If you need apply a condition on max(_status_set_at) you can use HAVING
Related Topics
How to Load SQL Fixture in Django for User Model
Maintaining Referential Integrity - Good or Bad
Sequentially Number Rows by Keyed Group in SQL
How to Update All Columns with Insert ... on Conflict ...
After Installing SQL Server 2014 Express Can't Find Local Db
Selecting Nth Record in an SQL Query
What's the Simplest Way to Import an SQLite SQL File into a Web SQL Database
Using Backquote/Backticks for MySQL Queries
MySQL Nested Sets - How to Find Parent of Node
Does the Order of Columns Matter in a Group by Clause
Bulk Insert, SQL Server 2000, Unix Linebreaks
Which Lock Hints Should I Use (T-Sql)
Conditional Join Different Tables
Best Way to Compare Dates Without Time in SQL Server
Django: Using Custom Raw SQL Inserts with Executemany and MySQL