Select First and Last Row from Grouped Data

Select first and last row from grouped data

There is probably a faster way:

df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())

Select first and last row for each group and take the column value difference in MySQL?

Using a MySQL-8.0/ MariaDB-10.2+ window function:

SELECT symbol,
LAST - FIRST AS price_change
FROM
(SELECT DISTINCT symbol,
first_value(price) OVER w AS FIRST,
last_value(price) OVER w AS LAST
FROM ticks WINDOW w AS (PARTITION BY symbol
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS p

ref: fiddle

Get last row of each group in R

You might try:

a %>% 
group_by(ID) %>%
arrange(NUM) %>%
slice(n())

Get only the first and last rows of each group with pandas

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')

(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))

Time ID X Y
0 8:00 A 23 100
1 20:00 A 35 220
2 9:00 B 24 110
3 23:00 B 38 250
4 11:00 C 26 130
5 22:00 C 37 240
6 15:00 D 30 170

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.


Details

g.head(1)

Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170

g.tail(1)

Time ID X Y
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250

pd.concat([g.head(1), g.tail(1)])

Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250

Use dplyr to take first and last row in a sequence by group

I think that slice should get you close:

df %>%
group_by(A,C) %>%
slice(c(1, n()))

gives

      A     B     C
<chr> <int> <dbl>
1 a 2 0
2 a 6 0
3 a 1 1
4 a 10 1
5 b 1 0
6 b 9 0
7 b 4 1
8 b 10 1

though this doesn't quite match your expected outcome. n() gives the last row in the group.

After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice will work as described before. However, because some of your runs are only 1 row long, we need to only include n() if it is more than 1 (otherwise the 1 row shows up twice).

df %>%
mutate(groupChanged = (C != lag(C, default = C[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))

Gives

       A     B     C groupChanged toCutBy
<chr> <int> <dbl> <lgl> <int>
1 a 1 1 FALSE 0
2 a 2 0 TRUE 1
3 a 6 0 FALSE 1
4 a 7 1 TRUE 2
5 a 10 1 FALSE 2
6 b 1 0 TRUE 3
7 b 3 0 FALSE 3
8 b 4 1 TRUE 4
9 b 5 0 TRUE 5
10 b 9 0 FALSE 5
11 b 10 1 TRUE 6

If the runs of 1 or 0 must stay within the level in column A, you also need to add a check for a change in column A to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.

df %>%
mutate(groupChanged = (C != lag(C, default = C[1]) |
A != lag(A, default = A[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))

Select start from first row and end from last row by group in R

Here is my approach with dplyr and rle():

library(dplyr)

my.data %>%
mutate(rlid = rep(seq_along(rle(value)$values), rle(value)$lengths)) %>%
group_by(rlid, ID) %>%
summarize(value = first(value), start = min(start), end = max(end)) %>%
ungroup() %>%
select(-rlid)

Returns:

# A tibble: 5 x 4
ID value start end
<chr> <dbl> <date> <date>
1 A 1 2020-01-01 2020-05-01
2 A 2 2020-06-03 2020-07-04
3 A 1 2020-07-04 NA
4 B 2 2020-02-02 2020-04-04
5 B 3 2020-04-04 NA

(Data used)

my.data <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), value = c(1, 1, 1, 2, 1, 2, 2, 3, 3), start = structure(c(18262, 18322, 18353, 18416, 18447, 18294, 18324, 18356, 18387), class = "Date"), end = structure(c(18322, 18353, 18383, 18447, NA, 18324, 18356, 18387, NA), class = "Date")), class = "data.frame", row.names = c(NA, -9L))

Comparing first and last row of groupby and creating new value

Not sure if it is effecient enough, but it works well.

def check_status(group):
selected = [False] * len(group)
selected[0] = selected[-1] = True
new_group = group[selected]
new_group['status'] = 'change' if new_group.category.is_unique else 'no change'
return new_group

print(df.groupby('email').apply(check_status).reset_index(drop=True))

Select first and last record of each group

You cannot use FIRST_VALUE directly here because it is a window function, not an aggregate function.

You need to embed it in a subquery/derived table and use aggregation over it.

Also, it needs a PARTITION BY clause, as well as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

SELECT 
cih.Code
,cih.Date
,MIN(cih.OpenPerDay) AS [Open]
,MAX(cih.[Close]) AS High
,MIN(cih.[Close]) AS Low
,MIN(cih.ClosePerDay) AS [Close]
FROM (
SELECT
*
,CAST(cih.Time AS date) AS Date
,FIRST_VALUE(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS OpenPerDay
,LAST_VALUE(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ClosePerDay
FROM CurrencyIntradayHistories cih
) cih
GROUP BY
cih.Code,
cih.Date
ORDER BY
cih.Date;

A slightly more efficient version uses ROW_NUMBER and LEAD to find the starting and ending rows.

SELECT 
cih.Code
,cih.Date
,MIN(CASE WHEN cih.rn = 1 THEN cih.[Close] END) AS [Open]
,MAX(cih.[Close]) AS High
,MIN(cih.[Close]) AS Low
,MIN(CASE WHEN cih.NextClose IS NULL THEN cih.[Close] END) AS [Close]
FROM (
SELECT
*
,CAST(cih.Time AS date) AS Date
,ROW_NUMBER() OVER (PARTITION BY cih.Code, CAST(cih.Time AS date) ORDER BY cih.Time) AS rn
,LEAD(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time) AS NextClose
FROM CurrencyIntradayHistories cih
) cih
GROUP BY
cih.Code,
cih.Date
ORDER BY
cih.Date;

db<>fiddle

How to get the first and the last record per group in SQL Server 2008?

How about using ROW_NUMBER:

SQL Fiddle

WITH Cte AS(
SELECT *,
RnAsc = ROW_NUMBER() OVER(PARTITION BY [group] ORDER BY val),
RnDesc = ROW_NUMBER() OVER(PARTITION BY [group] ORDER BY val DESC)
FROM tbl
)
SELECT
id, [group], val, start, [end]
FROM Cte
WHERE
RnAsc = 1 OR RnDesc = 1
ORDER BY [group], val

Calculating the difference between first and last row in each group

(Assuming dplyr.) Not assuming that date is guaranteed to be in order; if it is, then one could also use first(.)/last(.) for the same results. I tend to prefer not trusting order ...)

If your discount is always 0/1 and you are looking to group by contiguous same-values, then

dat %>%
group_by(discountgrp = cumsum(discount != lag(discount, default = discount[1]))) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discountgrp change
# <int> <dbl>
# 1 0 -0.871
# 2 1 -0.481

If your discount is instead a categorical value and can exceed 1, then

dat %>%
group_by(discount) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discount change
# <dbl> <dbl>
# 1 0 -0.871
# 2 1 -0.481

They happen to be the same here, but if the row order were changed such that some of the 1s occurred in the middle of 0s (for instance), then the groups would be different.



Related Topics



Leave a reply



Submit