Select first and last row from grouped data
There is probably a faster way:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
Select first and last row for each group and take the column value difference in MySQL?
Using a MySQL-8.0/ MariaDB-10.2+ window function:
SELECT symbol,
LAST - FIRST AS price_change
FROM
(SELECT DISTINCT symbol,
first_value(price) OVER w AS FIRST,
last_value(price) OVER w AS LAST
FROM ticks WINDOW w AS (PARTITION BY symbol
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS p
ref: fiddle
Get last row of each group in R
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
Get only the first and last rows of each group with pandas
Use groupby
, find the head
and tail
for each group, and concat
the two.
g = df.groupby('ID')
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
Time ID X Y
0 8:00 A 23 100
1 20:00 A 35 220
2 9:00 B 24 110
3 23:00 B 38 250
4 11:00 C 26 130
5 22:00 C 37 240
6 15:00 D 30 170
If you can guarantee each ID group has at least two rows, the drop_duplicates
call is not needed.
Details
g.head(1)
Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
g.tail(1)
Time ID X Y
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250
pd.concat([g.head(1), g.tail(1)])
Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250
Use dplyr to take first and last row in a sequence by group
I think that slice
should get you close:
df %>%
group_by(A,C) %>%
slice(c(1, n()))
gives
A B C
<chr> <int> <dbl>
1 a 2 0
2 a 6 0
3 a 1 1
4 a 10 1
5 b 1 0
6 b 9 0
7 b 4 1
8 b 10 1
though this doesn't quite match your expected outcome. n() gives the last row in the group.
After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice
will work as described before. However, because some of your runs are only 1 row long, we need to only include n()
if it is more than 1 (otherwise the 1 row shows up twice).
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
Gives
A B C groupChanged toCutBy
<chr> <int> <dbl> <lgl> <int>
1 a 1 1 FALSE 0
2 a 2 0 TRUE 1
3 a 6 0 FALSE 1
4 a 7 1 TRUE 2
5 a 10 1 FALSE 2
6 b 1 0 TRUE 3
7 b 3 0 FALSE 3
8 b 4 1 TRUE 4
9 b 5 0 TRUE 5
10 b 9 0 FALSE 5
11 b 10 1 TRUE 6
If the runs of 1 or 0 must stay within the level in column A
, you also need to add a check for a change in column A
to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]) |
A != lag(A, default = A[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
Select start from first row and end from last row by group in R
Here is my approach with dplyr
and rle()
:
library(dplyr)
my.data %>%
mutate(rlid = rep(seq_along(rle(value)$values), rle(value)$lengths)) %>%
group_by(rlid, ID) %>%
summarize(value = first(value), start = min(start), end = max(end)) %>%
ungroup() %>%
select(-rlid)
Returns:
# A tibble: 5 x 4
ID value start end
<chr> <dbl> <date> <date>
1 A 1 2020-01-01 2020-05-01
2 A 2 2020-06-03 2020-07-04
3 A 1 2020-07-04 NA
4 B 2 2020-02-02 2020-04-04
5 B 3 2020-04-04 NA
(Data used)
my.data <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), value = c(1, 1, 1, 2, 1, 2, 2, 3, 3), start = structure(c(18262, 18322, 18353, 18416, 18447, 18294, 18324, 18356, 18387), class = "Date"), end = structure(c(18322, 18353, 18383, 18447, NA, 18324, 18356, 18387, NA), class = "Date")), class = "data.frame", row.names = c(NA, -9L))
Comparing first and last row of groupby and creating new value
Not sure if it is effecient enough, but it works well.
def check_status(group):
selected = [False] * len(group)
selected[0] = selected[-1] = True
new_group = group[selected]
new_group['status'] = 'change' if new_group.category.is_unique else 'no change'
return new_group
print(df.groupby('email').apply(check_status).reset_index(drop=True))
Select first and last record of each group
You cannot use FIRST_VALUE
directly here because it is a window function, not an aggregate function.
You need to embed it in a subquery/derived table and use aggregation over it.
Also, it needs a PARTITION BY
clause, as well as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
SELECT
cih.Code
,cih.Date
,MIN(cih.OpenPerDay) AS [Open]
,MAX(cih.[Close]) AS High
,MIN(cih.[Close]) AS Low
,MIN(cih.ClosePerDay) AS [Close]
FROM (
SELECT
*
,CAST(cih.Time AS date) AS Date
,FIRST_VALUE(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS OpenPerDay
,LAST_VALUE(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ClosePerDay
FROM CurrencyIntradayHistories cih
) cih
GROUP BY
cih.Code,
cih.Date
ORDER BY
cih.Date;
A slightly more efficient version uses ROW_NUMBER
and LEAD
to find the starting and ending rows.
SELECT
cih.Code
,cih.Date
,MIN(CASE WHEN cih.rn = 1 THEN cih.[Close] END) AS [Open]
,MAX(cih.[Close]) AS High
,MIN(cih.[Close]) AS Low
,MIN(CASE WHEN cih.NextClose IS NULL THEN cih.[Close] END) AS [Close]
FROM (
SELECT
*
,CAST(cih.Time AS date) AS Date
,ROW_NUMBER() OVER (PARTITION BY cih.Code, CAST(cih.Time AS date) ORDER BY cih.Time) AS rn
,LEAD(cih.[Close]) OVER (PARTITION BY cih.Code, CAST(cih.Time AS date)
ORDER BY cih.Time) AS NextClose
FROM CurrencyIntradayHistories cih
) cih
GROUP BY
cih.Code,
cih.Date
ORDER BY
cih.Date;
db<>fiddle
How to get the first and the last record per group in SQL Server 2008?
How about using ROW_NUMBER
:
SQL Fiddle
WITH Cte AS(
SELECT *,
RnAsc = ROW_NUMBER() OVER(PARTITION BY [group] ORDER BY val),
RnDesc = ROW_NUMBER() OVER(PARTITION BY [group] ORDER BY val DESC)
FROM tbl
)
SELECT
id, [group], val, start, [end]
FROM Cte
WHERE
RnAsc = 1 OR RnDesc = 1
ORDER BY [group], val
Calculating the difference between first and last row in each group
(Assuming dplyr
.) Not assuming that date
is guaranteed to be in order; if it is, then one could also use first(.)
/last(.)
for the same results. I tend to prefer not trusting order ...)
If your discount
is always 0/1 and you are looking to group by contiguous same-values, then
dat %>%
group_by(discountgrp = cumsum(discount != lag(discount, default = discount[1]))) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discountgrp change
# <int> <dbl>
# 1 0 -0.871
# 2 1 -0.481
If your discount
is instead a categorical value and can exceed 1, then
dat %>%
group_by(discount) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discount change
# <dbl> <dbl>
# 1 0 -0.871
# 2 1 -0.481
They happen to be the same here, but if the row order were changed such that some of the 1
s occurred in the middle of 0
s (for instance), then the groups would be different.
Related Topics
Why Does X[Y] Join of Data.Tables Not Allow a Full Outer Join, or a Left Join
Subset Data Frame Based on Multiple Conditions
Test If Characters Are in a String
How to Save Plots That Are Made in a Shiny App
R Shiny Passing Reactive to Selectinput Choices
All Levels of a Factor in a Model Matrix in R
R Conditional Evaluation When Using the Pipe Operator %≫%
How to Put Labels Over Geom_Bar For Each Bar in R With Ggplot2
Quit and Restart a Clean R Session from Within R
How to Divide Each Row of a Matrix by Elements of a Vector in R
Custom Legend For Multiple Layer Ggplot
Dplyr Summarise: Equivalent of ".Drop=False" to Keep Groups With Zero Length in Output
How to Get Week Numbers from Dates