Select first and last row from grouped data
There is probably a faster way:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
Use dplyr to take first and last row in a sequence by group
I think that slice
should get you close:
df %>%
group_by(A,C) %>%
slice(c(1, n()))
gives
A B C
<chr> <int> <dbl>
1 a 2 0
2 a 6 0
3 a 1 1
4 a 10 1
5 b 1 0
6 b 9 0
7 b 4 1
8 b 10 1
though this doesn't quite match your expected outcome. n() gives the last row in the group.
After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice
will work as described before. However, because some of your runs are only 1 row long, we need to only include n()
if it is more than 1 (otherwise the 1 row shows up twice).
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
Gives
A B C groupChanged toCutBy
<chr> <int> <dbl> <lgl> <int>
1 a 1 1 FALSE 0
2 a 2 0 TRUE 1
3 a 6 0 FALSE 1
4 a 7 1 TRUE 2
5 a 10 1 FALSE 2
6 b 1 0 TRUE 3
7 b 3 0 FALSE 3
8 b 4 1 TRUE 4
9 b 5 0 TRUE 5
10 b 9 0 FALSE 5
11 b 10 1 TRUE 6
If the runs of 1 or 0 must stay within the level in column A
, you also need to add a check for a change in column A
to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]) |
A != lag(A, default = A[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
How to select last N observation from each group in dplyr dataframe?
As it is a specific question based on dplyr
1) after the group_by
, use slice
on the row_number()
library(tidyverse)
df %>%
group_by(a) %>%
slice(tail(row_number(), 2))
# A tibble: 8 x 2
# Groups: a [4]
# a b
# <dbl> <dbl>
#1 1 343
#2 1 54
#3 2 55
#4 2 62
#5 3 59
#6 3 -9
#7 4 0
#8 4 -0.5
2) Or use filter
from dplyr
df %>%
group_by(a) %>%
filter(row_number() >= (n() - 1))
3) or with do
and tail
df %>%
group_by(a) %>%
do(tail(., 2))
4) In addition to the tidyverse
, methods, we can also use compact data.table
library(data.table)
setDT(df)[df[, .I[tail(seq_len(.N), 2)], a]$V1]
5) Or by
from base R
by(df, df$a, FUN = tail, 2)
6) or with aggregate
from base R
df[aggregate(c ~ a, transform(df, c = seq_len(nrow(df))), FUN = tail, 2)$c,]
7) or with split
from base R
do.call(rbind, lapply(split(df, df$a), tail, 2))
paste together first and last value by group
You could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
Key = paste(link[1], link[n()], sep = ", ")
)
Though that wouldn't match your desired output. In your example data frame, you have e.g. the group 91
where there's only 1 value. The above code would give you 9|10
repeatedly both as beginning and end.
If you'd like to only display one value in such cases, you can do:
df %>%
group_by(group) %>%
mutate(
Key = case_when(
n() > 1 ~ paste(link[1], link[n()], sep = ", "),
TRUE ~ as.character(link)
)
)
Calculating the difference between first and last row in each group
(Assuming dplyr
.) Not assuming that date
is guaranteed to be in order; if it is, then one could also use first(.)
/last(.)
for the same results. I tend to prefer not trusting order ...)
If your discount
is always 0/1 and you are looking to group by contiguous same-values, then
dat %>%
group_by(discountgrp = cumsum(discount != lag(discount, default = discount[1]))) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discountgrp change
# <int> <dbl>
# 1 0 -0.871
# 2 1 -0.481
If your discount
is instead a categorical value and can exceed 1, then
dat %>%
group_by(discount) %>%
summarize(change = price[which.max(date)] - price[which.min(date)])
# # A tibble: 2 x 2
# discount change
# <dbl> <dbl>
# 1 0 -0.871
# 2 1 -0.481
They happen to be the same here, but if the row order were changed such that some of the 1
s occurred in the middle of 0
s (for instance), then the groups would be different.
Get last row of each group in R
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
Select the first and last row by group in a data frame
A plyr solution (tmp
is your data frame):
library("plyr")
ddply(tmp, .(id), function(x) x[c(1, nrow(x)), ])
# id d gr mm area
# 1 15 1 2 3.4 1
# 2 15 1 1 5.5 2
# 3 21 1 1 4.0 2
# 4 21 1 2 3.8 2
# 5 22 1 1 4.0 2
# 6 22 1 2 4.6 2
# 7 23 1 1 2.7 2
# 8 23 1 2 3.0 2
# 9 24 1 1 3.0 2
# 10 24 1 2 2.0 3
Or with dplyr (see also here):
library("dplyr")
tmp %>%
group_by(id) %>%
slice(c(1, n())) %>%
ungroup()
# # A tibble: 10 × 5
# id d gr mm area
# <int> <int> <int> <dbl> <int>
# 1 15 1 2 3.4 1
# 2 15 1 1 5.5 2
# 3 21 1 1 4.0 2
# 4 21 1 2 3.8 2
# 5 22 1 1 4.0 2
# 6 22 1 2 4.6 2
# 7 23 1 1 2.7 2
# 8 23 1 2 3.0 2
# 9 24 1 1 3.0 2
# 10 24 1 2 2.0 3
Extract the first and last observation of each group, every time that group appears within a large dataset?
Here is a data.table
approach. As mentioned by @Henrik in the comments, you can use rleid
to create a new column to group by, instead of using station
as values for station
get repeated. rleid
. Then, for each group, it will include the first and last .N
values. Note that unique
is added to consider situations when only one row of data may be present for a given group. I hope this may be a fast solution for you.
library(data.table)
setDT(df)
df[ , id := rleid(station)][ , .SD[unique(c(1, .N))], by = id]
Output
id tagID date station temp depth
1: 1 8272 2020-07-12 4 10 6.14
2: 1 8272 2020-07-13 4 11 21
3: 2 8272 2020-07-13 5 12 23.5
4: 2 8272 2020-07-16 5 10 15.4
5: 3 8272 2020-07-17 6 12 54
6: 3 8272 2020-07-29 6 12 23
7: 4 8272 2020-07-30 4 12 33.3
8: 4 8272 2020-08-04 4 9 32.7
Related Topics
How to Increase the Limit for Max.Print in R
Return Index of the Smallest Value in a Vector
How to Specify Columns in Facet_Grid or How to Change Labels in Facet_Wrap
Asymmetric Color Distribution in Scale_Gradient2
Reset Par to the Default Values at Startup
Joining Means on a Boxplot with a Line (Ggplot2)
Simple Examples of Filter Function, Recursive Option Specifically
Adding Column If It Does Not Exist
Plotting Interpolated Data on Map
Modify Glm Function to Adopt User-Specified Link Function in R
R - How to Make Barplot Plot Zeros for Missing Values Over the Data Range
How to Load Data Quickly into R
Grouping 2 Levels of a Factor in R
Convert List of Vectors to Data Frame
Count Number of Records and Generate Row Number Within Each Group in a Data.Table