unique.data.table select last row in place of the first
Create a data.table
that contains the unique combinations of the key variables then join using mult = 'last'
Using .SD
is convenient, but slow. You could use .I
instead if you wished.
dtu <- unique(dt)[,key(dt), with = FALSE]
dt[dtu, mult = 'last']
Or
dt[ dt[, .I[.N], by = key(dt)]$V1]
Select Last Rows with Distinct Field
If I understand correctly, you would like to count the number of distinct item ids up to each each row (by date) and return all rows where the count is three.
If Postgres supported this, you could use:
select t.*
from (select t.*,
count(*) filter (where id = min_id) over (order by date desc) as cnt_itemid
from (select t.*,
min(id) over (partition by itemid order by date desc) as min_id
from t
) t
) t
where cnt_itemid <= 3;
Alas, Postgres does not support COUNT(DISTINCT)
as a window function. But you can calculate it using DENSE_RANK()
:
select t.*
from (select t.*,
count(*) over (filter where id = min_id) as cnt_itemid
from (select t.*,
min(id) over (partition by itemid order by date) as min_id
from t
) t
) t
where cnt_itemid <= 3;
However, this returns all the most recent rows up before the 4th item -- so it has extra rows.
To get four rows, you want the first where the item id is "3". One method is:
select t.*
from (select t.*, min(id) filter (where cnt_itemid = 3) over () as min_cnt_itemid_3
from (select t.*,
count(*) filter (where id = min_id) over (order by date desc) as cnt_itemid
from (select t.*,
min(id) over (partition by itemid order by date desc) as min_id
from t
) t
) t
) t
where id <= min_cnt_itemid_3;
You can also do this by identifying the first occurrence of the "third item" and then choosing all rows up to that row:
select t.*
from t join
(select itemid, min(max_date) over () as min_max_date
from (select t.itemid, max(date) as max_date
from t
group by t.itemid
order by max(t.date) desc
limit 3
) t
) tt
on t.itemid = tt.itemid and t.date >= tt.min_max_date;
This fiddle shows each of these.
Select the first and last row by group in a data frame
A plyr solution (tmp
is your data frame):
library("plyr")
ddply(tmp, .(id), function(x) x[c(1, nrow(x)), ])
# id d gr mm area
# 1 15 1 2 3.4 1
# 2 15 1 1 5.5 2
# 3 21 1 1 4.0 2
# 4 21 1 2 3.8 2
# 5 22 1 1 4.0 2
# 6 22 1 2 4.6 2
# 7 23 1 1 2.7 2
# 8 23 1 2 3.0 2
# 9 24 1 1 3.0 2
# 10 24 1 2 2.0 3
Or with dplyr (see also here):
library("dplyr")
tmp %>%
group_by(id) %>%
slice(c(1, n())) %>%
ungroup()
# # A tibble: 10 × 5
# id d gr mm area
# <int> <int> <int> <dbl> <int>
# 1 15 1 2 3.4 1
# 2 15 1 1 5.5 2
# 3 21 1 1 4.0 2
# 4 21 1 2 3.8 2
# 5 22 1 1 4.0 2
# 6 22 1 2 4.6 2
# 7 23 1 1 2.7 2
# 8 23 1 2 3.0 2
# 9 24 1 1 3.0 2
# 10 24 1 2 2.0 3
Select first and last row from grouped data
There is probably a faster way:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
How to select distinct rows in a datatable and store into an array
DataView view = new DataView(table);
DataTable distinctValues = view.ToTable(true, "Column1", "Column2" ...);
Get the last row of a previous group in data.table
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt
, where newcol
matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ...
to denote the accumulated code:
dt[, last(Product), by = Group]$V1
is getting the last values from each group as a character vector.shift(...)
shifts the character vector in the previous calldt[, newcol := ...[.GRP], by = Group]
groups byGroup
and uses the internal.GRP
values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Extracting unique rows from a data table in R
Before data.table v1.9.8, the default behavior of unique.data.table
method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key
was NULL
(the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table
method uses all columns by default which is consistent with the unique.data.frame
in base R. To have it use the key columns, explicitly pass by = key(DT)
into unique
(replacing DT
in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
Select last row by group for all columns data.table
Last row by group :
DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"] # (1)
or, faster (avoid use of .SD
where possible, for speed) :
w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]] # (2)
DT[w]
Note that the following feature request will make approach (1) as fast as approach (2) :
FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.
Get last row that matches conditions in a data.table in R
You want to group by using both groups, and then use .N
to select the last line within each subset of data, ie .SD
:
R> d[, .SD[.N], by=.(SEQ,TYPE)]
SEQ TYPE CLASS
1: 1 A 2.1
2: 2 B 4.5
3: 1 C 1.3
4: 2 C 3.4
5: 3 C 4.6
6: 1 D 2.5
7: 2 D 3.6
8: 3 D 1.4
9: 4 D 2.7
10: 5 D 4.5
R>
using data.table to flag the first (or last) record in a group
Here are couple of solutions using data.table
:
## Option 1 (cleaner solution, added 2016-11-29)
uDT <- unique(DT)
DT[, c("first","last"):=0L]
DT[uDT, first:=1L, mult="first"]
DT[uDT, last:=1L, mult="last"]
## Option 2 (original answer, retained for posterity)
DT <- cbind(DT, first=0L, last=0L)
DT[DT[unique(DT),,mult="first", which=TRUE], first:=1L]
DT[DT[unique(DT),,mult="last", which=TRUE], last:=1L]
head(DT)
# x y first last
# [1,] a A 1 1
# [2,] a B 1 1
# [3,] a C 1 0
# [4,] a C 0 1
# [5,] b A 1 1
# [6,] b B 1 1
There's obviously a lot packed into each of those lines. The key construct, though, is the following, which returns the row index of the first record in each group:
DT[unique(DT),,mult="first", which=TRUE]
# [1] 1 2 3 5 6 7 11 13 15
Related Topics
How Do We Plot Images at Given Coordinates in R
Using Functions and Environments
Importing Multiple Excel Files with Filenames in R
Locator Equivalent in Ggplot2 (For Maps)
R: Replacing Nas in a Data.Frame with Values in the Same Position in Another Dataframe
How to Efficiently Read the First Character from Each Line of a Text File
Error in Get(As.Character(Fun), Mode = "Function", Envir = Envir)
Displaying Image on Point Hover in Plotly
R Corpus Is Messing Up My Utf-8 Encoded Text
Subset a Data.Frame with Multiple Conditions
Difference of Prediction Results in Random Forest Model
Findassocs for Multiple Terms in R
Match Dataframes Excluding Last Non-Na Value and Disregarding Order