Add Missing Rows to a Data Table

Add missing rows within a table

The hint would be: Use a join.

One way of approaching this is, that you select the key pairs that you expect and then left join the original table. Be conscious about the missing-value handling, since you have not specified in your question what should happen to those newly created entries.

Test Data

CREATE TABLE test (id INTEGER, doc INTEGER, posi INTEGER, total INTEGER);
INSERT INTO test VALUES (1, 123, 1, 100);
INSERT INTO test VALUES (1, 123, 2, 600);
INSERT INTO test VALUES (1, 123, 3, 200);
INSERT INTO test VALUES (2, 123, 1, 100);
INSERT INTO test VALUES (2, 123, 2, 600);
INSERT INTO test VALUES (2, 123, 3, 200);
INSERT INTO test VALUES (3, 123, 1, 100);
INSERT INTO test VALUES (3, 123, 3, 200);

The possible key combinations can be generated with a cross join:

SELECT DISTINCT a.id, b.posi 
FROM test a, test b

And now join the original table:

WITH expected_lines AS (
SELECT DISTINCT a.id, b.posi
FROM test a, test b
)
SELECT el.id, el.posi, t.doc, t.total
FROM expected_lines el
LEFT JOIN test t ON el.id = t.id AND el.posi = t.posi

You did not describe further, what should happen with the now empty columns. As you may note DOC and TOTAL are null.

My educated guess would be, that you want to make DOC part of the key and assume a TOTAL of 0. If that's the case, you can go with the following:

WITH expected_lines AS (
SELECT DISTINCT a.id, b.posi, c.doc
FROM test a, test b, test c
)
SELECT el.id, el.posi, el.doc, ifnull(t.total, 0) total
FROM expected_lines el
LEFT JOIN test t ON el.id = t.id AND el.posi = t.posi AND el.doc = t.doc

Result
Sample Image

Add missing rows to data.table

As indicated in @Roland's comment, instead of value = value in CJ(), use:

value = seq_len(max(value))

Or specify the range you would like in your value column.

Thus, you simply need to modify your attempt from being:

b = a[CJ(group = group, value = value, unique = TRUE), on = .(group,value)]

to being:

b = a[CJ(group = group, value = seq_len(max(value)), unique = TRUE),
on = .(group,value)]

Add missing rows to data.table according to multiple keyed columns

A couple of possibilities are here - https://github.com/Rdatatable/data.table/pull/814

CJ.dt = function(...) {
rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

setkey(mydata, name, job, sex, from)

mydata[CJ.dt(unique(data.table(name, job, sex)), unique(from))]
# name job sex from score
# 1: chris doctor male NYT 0.7383247
# 2: chris doctor male BG NA
# 3: chris doctor male TIME NA
# 4: chris doctor male USAT NA
# 5: chris lawyer female NYT NA
# 6: chris lawyer female BG -0.8204684
# 7: chris lawyer female TIME NA
# 8: chris lawyer female USAT NA
# 9: chris lawyer male NYT 0.4874291
#10: chris lawyer male BG NA
#11: chris lawyer male TIME NA
#12: chris lawyer male USAT NA
#13: john teacher male NYT -0.6264538
#14: john teacher male BG -0.8356286
#15: john teacher male TIME 1.5952808
#16: john teacher male USAT 0.1836433
#17: mary police female NYT NA
#18: mary police female BG NA
#19: mary police female TIME NA
#20: mary police female USAT 0.3295078

add missing rows to a data table

I'd get the unique values in id1 and id2 and do a join using data.table's cross join function CJ as follows:

# if you've already set the key:
ans <- f[CJ(id1, id2, unique=TRUE)][is.na(v), v := 0L][]

# or, if f is not keyed:
ans <- f[CJ(id1 = id1, id2 = id2, unique=TRUE), on=.(id1, id2)][is.na(v), v := 0L][]

ans

adding missing observations in data.table

I believe the issue is that CJ(l, l, 1994:1995) has duplicate names. This is hinted at by verbose=TRUE:

DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu)
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)

This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.

Anyway, you can get around this by naming the CJ arguments:

DT[CJ(from = l, to = l, year = 1994:1995)]
# from to year g
# 1: a a 1994 0.64364200
# 2: a a 1995 NA
# 3: a b 1994 0.69746294
# 4: a b 1995 0.56863539
# 5: a c 1994 0.64369566
# 6: a c 1995 NA
# 7: b a 1994 0.62198311
# 8: b a 1995 0.71919139
# 9: b b 1994 0.76170866
# 10: b b 1995 0.84792449
# 11: b c 1994 0.15793127
# 12: b c 1995 0.26623733
# 13: c a 1994 0.89921463
# 14: c a 1995 0.55417635
# 15: c b 1994 0.38938166
# 16: c b 1995 0.03778206
# 17: c c 1994 0.48918988
# 18: c c 1995 0.75206221

Note that we could also accomplish this without keys:

setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
# it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]

How to add missing rows to a data frame

You can get the cohort years range and use summarize() to expand the dataset, then left join back on the orginal:

df<-ungroup(df)

yrs = range(as.numeric(levels(df$cohort)))
unique(df[,c(1,3)]) %>%
group_by(var2kreuz,var2use) %>%
summarize(cohort = factor(yrs[1]:yrs[2])) %>%
left_join(df)

Alternatively, you can use complete() like this:

df %>% mutate(across(c(var2kreuz, var2use),as.character)) %>% 
complete(var2kreuz, var2use,cohort)

Output:

   var2kreuz var2use cohort  n proportion
1 KKK yes 2010 10 0.5555556
2 KKK yes 2011 19 0.5937500
3 KKK yes 2012 24 0.4615385
4 KKK yes 2013 19 0.4750000
5 KKK yes 2014 21 0.5675676
6 KKK yes 2015 NA NA
7 KKK yes 2016 NA NA
8 KKK yes 2017 NA NA
9 KKK yes 2018 23 0.6388889
10 KKK yes 2019 38 0.6031746
11 KKK yes 2020 24 0.4615385
12 KKK no 2010 8 0.4444444
13 KKK no 2011 13 0.4062500
14 KKK no 2012 28 0.5384615
15 KKK no 2013 21 0.5250000
16 KKK no 2014 16 0.4324324
17 KKK no 2015 NA NA
18 KKK no 2016 NA NA
19 KKK no 2017 NA NA
20 KKK no 2018 13 0.3611111
21 KKK no 2019 25 0.3968254
22 KKK no 2020 28 0.5384615

Adding row for missing value in data.table

You just do the same thing as in your linked question by each ida:

setkey(dt, idb, date)

dt[, .SD[CJ(unique(idb), unique(date))], by = ida][is.na(value), value := 0][]
# ida idb value date
#1: A 2 26600 2004-12-31
#2: A 2 0 2005-03-31
#3: A 3 0 2004-12-31
#4: A 3 19600 2005-03-31
#5: C 2 8700 2005-12-31
#6: B 3 18200 2005-06-30
#7: B 3 0 2005-09-30
#8: B 4 0 2005-06-30
#9: B 4 1230 2005-09-30

Insert all missing rows into data table for a range of values for 2 columns

Instead of the already existing values in 'a' column, we can have a range of values to pass into 'CJ' for the 'a'

dt1[CJ(a = 1:7, b, unique = TRUE)]


Related Topics



Leave a reply



Submit