Index unique values in data.table
I have a few ideas. You can use a nested group counter:
in.data[, w := setDT(list(v = vendor))[, g := .GRP, by=v]$g, by=fruits]
Alternately, make a run ID, which depends on sorted data (thanks @eddi) and seems wasteful:
in.data[, w := rleid(vendor), by=fruits]
The base-R approach would probably be:
in.data[, w := match(vendor, unique(vendor)), by=fruits]
# or in base R ...
in.data$w = with(in.data, ave(vendor, fruits, FUN = function(x) match(x, unique(x))))
How to keep unique list-column values using data.table in R?
We may use duplicated
with unnest
library(tidyr)
dt[, .(data = list(.SD)), by = id][!duplicated(data)] %>%
unnest(data)
-output
# A tibble: 4 × 3
id value1 value2
<chr> <dbl> <dbl>
1 a 1 0
2 a 1 3
3 b 1 0
4 b 2 3
Finding the index or unique values from a dataframe column
Here is a dplyr
solution where we create a variable with the row_number()
, and use that as our index, i.e.
df %>%
mutate(new = row_number()) %>%
group_by(TableName) %>%
summarise(Index = toString(new))
which gives,
# A tibble: 3 x 2
TableName Index
<fct> <chr>
1 A 1, 3
2 B 2, 4
3 C 5
You can also save them as lists rather than strings, which will make future operations easier, i.e.
df %>%
mutate(new = row_number()) %>%
group_by(TableName) %>%
summarise(Index = list(new))
which gives,
# A tibble: 3 x 2
TableName Index
<fct> <list>
1 A <int [2]>
2 B <int [2]>
3 C <int [1]>
Matching unique and non-unique values between data.tables and update data.table
For each group of AREA_CD
and TYPE
, the OP wants to match the rows one to one in the order they appear in both data.tables. E.g., the first row with AREA_CD == "A1" & TYPE == 1
in dt1
shall match the first row with AREA_CD == "A1" & TYPE == 1
in dt2
, then the second rows and so forth.
This can be done in a join operation if a row index I
(or running count) within each group is added:
# add row indices
dt1[, I := seq_len(.N), by = .(AREA_CD, TYPE)]
dt2[, I := seq_len(.N), by = .(AREA_CD, TYPE)]
# alternative code: same result as above but more concise
dt1[, I := rowid(AREA_CD, TYPE)]
dt2[, I := rowid(AREA_CD, TYPE)]
# right join (all rows of dt1 are used)
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I)]
# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
# U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
# 1: U1 A1 1 0 1 0 0 ID1
# 2: U11 A1 1 0 2 0 0 ID11
# 3: U21 A1 1 0 3 0 0 ID21
#...
#19: U181 A1 1 0 19 0 0 ID181
#20: U191 A1 1 0 20 0 0 ID191
#21: NA A1 1 NA 21 0 0 ID201
Note that the last row has a NA
in some columns. This is due to the different number of rows for this group in dt1
and dt2
. dt1
has 21 rows while dt2
has only 20 rows. So, the last row of dt1
has no match in dt2
.
Alternatively, a inner join will only return rows with a match in both dt1
and dt2
:
# inner join
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I), nomatch = 0]
# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
# U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
# 1: U1 A1 1 0 1 0 0 ID1
# 2: U11 A1 1 0 2 0 0 ID11
# 3: U21 A1 1 0 3 0 0 ID21
#...
#18: U171 A1 1 0 18 0 0 ID171
#19: U181 A1 1 0 19 0 0 ID181
#20: U191 A1 1 0 20 0 0 ID191
Now, only 20 rows are returned for this group.
Data
dt1 <- data.table(
AREA_CD = factor(c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300))),
TYPE = rep(1:10),
ALLOCATED = 0,
U_ID = 0,
ID_CD = factor(paste0("ID", 1:1000)))
dt2 <- data.table(
U_ID = factor(paste0("U", 1:1000)),
AREA_CD = factor(c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310))),
TYPE = rep(1:10),
ASSIGNED = 0)
Note that ID_CD
and U_ID
are created using paste0()
instead of interaction()
which turned out to be rather slow.
table index for DISTINCT values
An index on these specific columns could improve performance by a bit, but just because it will require SQL Server to scan less data (just these specific columns, nothing else). Other than that - a SCAN will always be done. An option would be to create indexed view if you need distinct values from that table.
CREATE VIEW Test
WITH SCHEMABINDING
AS
SELECT Column1, COUNT_BIG(*) AS UselessColumn
FROM Table1
GROUP BY Column1;
GO
CREATE UNIQUE CLUSTERED INDEX PK_Test ON Test (Column1);
GO
And then you can query it like that:
SELECT *
FROM Test WITH (NOEXPAND);
NOEXPAND
is a hint needed for SQL Server to not expand query in a view and treat it as a table. Note: this is needed for non Enterprise version of SQL Server only.
Rolling index for semi-unique values in two columns R
Method-1 baseR way
df <- data.frame("participant" = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c"),
"item" = c("X", "X", "Y", "X", "X", "X", "Y", "Z", "Z", "Z"))
transform(df, item_id = with(rle(paste(participant, item)), rep(seq_len(length(lengths)), lengths)))
#> participant item item_id
#> 1 a X 1
#> 2 a X 1
#> 3 a Y 2
#> 4 a X 3
#> 5 b X 4
#> 6 b X 4
#> 7 b Y 5
#> 8 b Z 6
#> 9 c Z 7
#> 10 c Z 7
Created on 2021-05-21 by the reprex package (v2.0.0)
Method-2 data.table::rleid()
df <- data.frame("participant" = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c"),
"item" = c("X", "X", "Y", "X", "X", "X", "Y", "Z", "Z", "Z"))
library(data.table)
library(tidyverse)
df %>% mutate(item_id = rleid(participant, item))
#> participant item item_id
#> 1 a X 1
#> 2 a X 1
#> 3 a Y 2
#> 4 a X 3
#> 5 b X 4
#> 6 b X 4
#> 7 b Y 5
#> 8 b Z 6
#> 9 c Z 7
#> 10 c Z 7
Created on 2021-05-21 by the reprex package (v2.0.0)
How to select distinct rows in a datatable and store into an array
DataView view = new DataView(table);
DataTable distinctValues = view.ToTable(true, "Column1", "Column2" ...);
R data.table column that counts along unique values in other column
Another way is using data.table::rowid
.
df[,fid := rowid(class)]
How can I extract all Unique / Distinct Rows from a Datatable and save these rows in a new Datatable with same Columns?
A couple of options that allow to filter the DataRows of a DataTable, based on the value of a specific Column, to generate a new DataTable with the resulting DataRows.
Considering - since you mentioned it - that is not important which DataRow is selected, i.e., any duplicate DataRow would do:
(if the DataRow to select becomes important at some point, you could also OrderBy()
the grouping using the value of another Column, then pick the first - or the last - DataRow from the ordered collection)
Group DataRows by the value of a Column:
- Group the DataRows of the source DataTable using the value of a Column
- Select the first DataRow of each grouping
- Call the
CopyToDataTable()
method to generate a new DataTable
Resulting in:
Dim newDt = [DataTable].AsEnumerable().
GroupBy(Function(r) r("[Column Name]")).
Select(Function(g) g.First()).
CopyToDataTable()
Use a custom EqualityComparer:
- Build a simple EqualityComparer class that compares the values of the same Column of two DataRows objects
- Use the
Distinct()
method and pass the custom EqualityComparer, initialized with the name of the Column used as comparer - Call the
CopyToDataTable()
method
This method has the advantage that is reusable (i.e., you don't need to rebuild a query, just initialize the comparer with the name of the Column to compare)
Resulting in:
Dim newDt = [DataTable].AsEnumerable().
Distinct(New DataRowColumnComparer("[Column Name]")).
CopyToDataTable()
Custom EqualityComparer
:
It's kind of a basic comparer. You can of course extend it to use different indexers (an integer representing the index of a Column, or a DataColumn reference).
Public Class DataRowColumnComparer
Implements IEqualityComparer(Of DataRow)
Private ReadOnly t As String = String.Empty
Public Sub New(key As String)
If String.IsNullOrEmpty(key) Then Throw New ArgumentException("Empty key")
t = key
End Sub
Public Overloads Function Equals(dr1 As DataRow, dr2 As DataRow) As Boolean Implements IEqualityComparer(Of DataRow).Equals
If dr1 Is Nothing AndAlso dr2 Is Nothing Then Return True
If dr1 Is Nothing OrElse dr2 Is Nothing Then Return False
Return dr1(t).Equals(dr2(t))
End Function
Public Overloads Function GetHashCode(dr As DataRow) As Integer Implements IEqualityComparer(Of DataRow).GetHashCode
If dr(t) Is Nothing OrElse dr(t) Is DBNull.Value Then Return 0
Return dr(t).GetHashCode()
End Function
End Class
Related Topics
Reading Big Data with Fixed Width
R 3.3.0 Installing a Package on Windows: Gcc Not Found Error
Writing Data Frame to PDF Table
How to Train a Ml Model in Sparklyr and Predict New Values on Another Dataframe
How to Add a Condition to the Geom_Point Size
How to Calculate Mean of All Columns, by Group
Automated Httr Authentication with Twitter , Provide Response to Interactive Prompt in "Batch" Mode
How to Stack Only Some Columns in a Data Frame
Calculating Percentile of Dataset Column
How to Plot a Normal Distribution by Labeling Specific Parts of the X-Axis
Extracting a Random Sample of Rows in a Data.Frame with a Nested Conditional
Increase Space Between Bars in Ggplot
Extract the Coefficients for the Best Tuning Parameters of a Glmnet Model in Caret