Index Unique Values in Data.Table

Index unique values in data.table

I have a few ideas. You can use a nested group counter:

in.data[, w := setDT(list(v = vendor))[, g := .GRP, by=v]$g, by=fruits]

Alternately, make a run ID, which depends on sorted data (thanks @eddi) and seems wasteful:

in.data[, w := rleid(vendor), by=fruits]

The base-R approach would probably be:

in.data[, w := match(vendor, unique(vendor)), by=fruits]

# or in base R ...

in.data$w = with(in.data, ave(vendor, fruits, FUN = function(x) match(x, unique(x))))

How to keep unique list-column values using data.table in R?

We may use duplicated with unnest

library(tidyr)
dt[, .(data = list(.SD)), by = id][!duplicated(data)] %>%
unnest(data)

-output

# A tibble: 4 × 3
id value1 value2
<chr> <dbl> <dbl>
1 a 1 0
2 a 1 3
3 b 1 0
4 b 2 3

Finding the index or unique values from a dataframe column

Here is a dplyr solution where we create a variable with the row_number(), and use that as our index, i.e.

df %>% 
mutate(new = row_number()) %>%
group_by(TableName) %>%
summarise(Index = toString(new))

which gives,

# A tibble: 3 x 2
TableName Index
<fct> <chr>
1 A 1, 3
2 B 2, 4
3 C 5

You can also save them as lists rather than strings, which will make future operations easier, i.e.

df %>% 
mutate(new = row_number()) %>%
group_by(TableName) %>%
summarise(Index = list(new))

which gives,

# A tibble: 3 x 2
TableName Index
<fct> <list>
1 A <int [2]>
2 B <int [2]>
3 C <int [1]>

Matching unique and non-unique values between data.tables and update data.table

For each group of AREA_CD and TYPE, the OP wants to match the rows one to one in the order they appear in both data.tables. E.g., the first row with AREA_CD == "A1" & TYPE == 1 in dt1 shall match the first row with AREA_CD == "A1" & TYPE == 1 in dt2, then the second rows and so forth.

This can be done in a join operation if a row index I (or running count) within each group is added:

# add row indices
dt1[, I := seq_len(.N), by = .(AREA_CD, TYPE)]
dt2[, I := seq_len(.N), by = .(AREA_CD, TYPE)]

# alternative code: same result as above but more concise
dt1[, I := rowid(AREA_CD, TYPE)]
dt2[, I := rowid(AREA_CD, TYPE)]

# right join (all rows of dt1 are used)
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I)]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
# U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
# 1: U1 A1 1 0 1 0 0 ID1
# 2: U11 A1 1 0 2 0 0 ID11
# 3: U21 A1 1 0 3 0 0 ID21
#...
#19: U181 A1 1 0 19 0 0 ID181
#20: U191 A1 1 0 20 0 0 ID191
#21: NA A1 1 NA 21 0 0 ID201

Note that the last row has a NA in some columns. This is due to the different number of rows for this group in dt1 and dt2. dt1 has 21 rows while dt2 has only 20 rows. So, the last row of dt1 has no match in dt2.

Alternatively, a inner join will only return rows with a match in both dt1 and dt2:

# inner join
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I), nomatch = 0]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
# U_ID AREA_CD TYPE ASSIGNED I ALLOCATED i.U_ID ID_CD
# 1: U1 A1 1 0 1 0 0 ID1
# 2: U11 A1 1 0 2 0 0 ID11
# 3: U21 A1 1 0 3 0 0 ID21
#...
#18: U171 A1 1 0 18 0 0 ID171
#19: U181 A1 1 0 19 0 0 ID181
#20: U191 A1 1 0 20 0 0 ID191

Now, only 20 rows are returned for this group.


Data

dt1 <- data.table(
AREA_CD = factor(c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300))),
TYPE = rep(1:10),
ALLOCATED = 0,
U_ID = 0,
ID_CD = factor(paste0("ID", 1:1000)))
dt2 <- data.table(
U_ID = factor(paste0("U", 1:1000)),
AREA_CD = factor(c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310))),
TYPE = rep(1:10),
ASSIGNED = 0)

Note that ID_CD and U_ID are created using paste0() instead of interaction() which turned out to be rather slow.

table index for DISTINCT values

An index on these specific columns could improve performance by a bit, but just because it will require SQL Server to scan less data (just these specific columns, nothing else). Other than that - a SCAN will always be done. An option would be to create indexed view if you need distinct values from that table.

CREATE VIEW Test
WITH SCHEMABINDING
AS
SELECT Column1, COUNT_BIG(*) AS UselessColumn
FROM Table1
GROUP BY Column1;
GO
CREATE UNIQUE CLUSTERED INDEX PK_Test ON Test (Column1);
GO

And then you can query it like that:

SELECT *
FROM Test WITH (NOEXPAND);

NOEXPAND is a hint needed for SQL Server to not expand query in a view and treat it as a table. Note: this is needed for non Enterprise version of SQL Server only.

Rolling index for semi-unique values in two columns R

Method-1 baseR way

df <- data.frame("participant" = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c"), 
"item" = c("X", "X", "Y", "X", "X", "X", "Y", "Z", "Z", "Z"))

transform(df, item_id = with(rle(paste(participant, item)), rep(seq_len(length(lengths)), lengths)))
#> participant item item_id
#> 1 a X 1
#> 2 a X 1
#> 3 a Y 2
#> 4 a X 3
#> 5 b X 4
#> 6 b X 4
#> 7 b Y 5
#> 8 b Z 6
#> 9 c Z 7
#> 10 c Z 7

Created on 2021-05-21 by the reprex package (v2.0.0)


Method-2 data.table::rleid()

df <- data.frame("participant" = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c"), 
"item" = c("X", "X", "Y", "X", "X", "X", "Y", "Z", "Z", "Z"))
library(data.table)
library(tidyverse)
df %>% mutate(item_id = rleid(participant, item))
#> participant item item_id
#> 1 a X 1
#> 2 a X 1
#> 3 a Y 2
#> 4 a X 3
#> 5 b X 4
#> 6 b X 4
#> 7 b Y 5
#> 8 b Z 6
#> 9 c Z 7
#> 10 c Z 7

Created on 2021-05-21 by the reprex package (v2.0.0)

How to select distinct rows in a datatable and store into an array

DataView view = new DataView(table);
DataTable distinctValues = view.ToTable(true, "Column1", "Column2" ...);

R data.table column that counts along unique values in other column

Another way is using data.table::rowid.

df[,fid := rowid(class)]

How can I extract all Unique / Distinct Rows from a Datatable and save these rows in a new Datatable with same Columns?

A couple of options that allow to filter the DataRows of a DataTable, based on the value of a specific Column, to generate a new DataTable with the resulting DataRows.

Considering - since you mentioned it - that is not important which DataRow is selected, i.e., any duplicate DataRow would do:

(if the DataRow to select becomes important at some point, you could also OrderBy() the grouping using the value of another Column, then pick the first - or the last - DataRow from the ordered collection)

Group DataRows by the value of a Column:

  • Group the DataRows of the source DataTable using the value of a Column
  • Select the first DataRow of each grouping
  • Call the CopyToDataTable() method to generate a new DataTable

Resulting in:

Dim newDt = [DataTable].AsEnumerable().
GroupBy(Function(r) r("[Column Name]")).
Select(Function(g) g.First()).
CopyToDataTable()

Use a custom EqualityComparer:

  • Build a simple EqualityComparer class that compares the values of the same Column of two DataRows objects
  • Use the Distinct() method and pass the custom EqualityComparer, initialized with the name of the Column used as comparer
  • Call the CopyToDataTable() method

This method has the advantage that is reusable (i.e., you don't need to rebuild a query, just initialize the comparer with the name of the Column to compare)

Resulting in:

Dim newDt = [DataTable].AsEnumerable().
Distinct(New DataRowColumnComparer("[Column Name]")).
CopyToDataTable()

Custom EqualityComparer:

It's kind of a basic comparer. You can of course extend it to use different indexers (an integer representing the index of a Column, or a DataColumn reference).

Public Class DataRowColumnComparer
Implements IEqualityComparer(Of DataRow)

Private ReadOnly t As String = String.Empty

Public Sub New(key As String)
If String.IsNullOrEmpty(key) Then Throw New ArgumentException("Empty key")
t = key
End Sub

Public Overloads Function Equals(dr1 As DataRow, dr2 As DataRow) As Boolean Implements IEqualityComparer(Of DataRow).Equals
If dr1 Is Nothing AndAlso dr2 Is Nothing Then Return True
If dr1 Is Nothing OrElse dr2 Is Nothing Then Return False
Return dr1(t).Equals(dr2(t))
End Function

Public Overloads Function GetHashCode(dr As DataRow) As Integer Implements IEqualityComparer(Of DataRow).GetHashCode
If dr(t) Is Nothing OrElse dr(t) Is DBNull.Value Then Return 0
Return dr(t).GetHashCode()
End Function
End Class


Related Topics



Leave a reply



Submit