Returning First Row of Group

Pandas dataframe get first row of each group


>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth

Returning first row of group

By reproducing the example data frame and testing it I found a way of getting the needed result:

  1. Order data by relevant columns (ID, Start)

    ordered_data <- data[order(data$ID, data$Start),]

  2. Find the first row for each new ID

    final <- ordered_data[!duplicated(ordered_data$ID),]

Why Select * Only return the first record in a group?


But select only returns the first record in the group, not all records in the group. Why?

Because one row per group is what GROUP BY does.

The way to do what you want to do is to figure out which countries have more that 5 customers, then return all records from those countries:

SELECT * FROM Customers WHERE Country IN
(SELECT Country FROM Customers GROUP BY Country HAVING COUNT(CustomerID) > 5)

Select first row in each GROUP BY group?


On databases that support CTE and windowing functions:

WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total

Get top 1 row of each group


;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

MySQL GROUP BY returns only first row

Thank you everyone for pointing out the obvious mistake I was too blind to see. I finally replaced GROUP BY with ORDER BY and included a WHERE clause to get my desired result. That is what I was intending to use all along. Silly me.

My final query becomes this-

SELECT * FROM forms WHERE GROUP='SomeGroup' ORDER BY 'GROUP'


Related Topics



Leave a reply



Submit