Pandas dataframe get first row of each group
>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
If you need id
as column:
>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth
To get n first records, you can use head():
>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth
Returning first row of group
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
Why Select * Only return the first record in a group?
But select only returns the first record in the group, not all records in the group. Why?
Because one row per group is what GROUP BY
does.
The way to do what you want to do is to figure out which countries have more that 5 customers, then return all records from those countries:
SELECT * FROM Customers WHERE Country IN
(SELECT Country FROM Customers GROUP BY Country HAVING COUNT(CustomerID) > 5)
Select first row in each GROUP BY group?
On databases that support CTE and windowing functions:
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total
Get top 1 row of each group
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
- maintain status in 2 places
- preserve status history
- ...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
MySQL GROUP BY returns only first row
Thank you everyone for pointing out the obvious mistake I was too blind to see. I finally replaced GROUP BY
with ORDER BY
and included a WHERE
clause to get my desired result. That is what I was intending to use all along. Silly me.
My final query becomes this-
SELECT * FROM forms WHERE GROUP='SomeGroup' ORDER BY 'GROUP'
Related Topics
Randomly Insert Nas into Dataframe Proportionaly
How to Match by Nearest Date from Two Data Frames
How to Use the Switch Statement in R Functions
How to Convert Data Frame to Spatial Coordinates
How to Set Fixed Continuous Colour Values in Ggplot2
Delete a Column in a Data Frame Within a List
Join Two Data Frames in R Based on Closest Timestamp
Improve Centering County Names Ggplot & Maps
Shiny: Passing Input$Var to Aes() in Ggplot2
Rename Multiple Columns Given Character Vectors of Column Names and Replacement
Loop in R: How to Save the Outputs
Removing One Tablegrob When Applied to a Box Plot with a Facet_Wrap
How to Specify a Dynamic Position for the Start of Substring
R - Ggplot2 Issues with Date as Character for X-Axis
How to Delete the First Row of a Dataframe in R
R: How to Rbind Two Huge Data-Frames Without Running Out of Memory
How to Save() with a Particular Variable Name
Conditional Binary Join and Update by Reference Using the Data.Table Package