How to Insert Missing Observations on a Data Frame

how to insert missing observations on a data frame

This largely depends on how general you wish your solution to be. But, if you want a non-general solution you can do #1 pretty simply. Here, I assume that you're using T as your time variable.

insert_miss <- function(df, time_val= "T", by= 1) {
  val <- get(time_val, envir= as.environment(df))
  val_range <- range(val)
  comp <- seq(val_range[1], val_range[2], by=by)
  which_miss <- comp[!comp %in% val]
  # generating a sample row depends a lot on your particular problem
  # also, specifically how to impute the missing values depends on your 
  # specific problem / domain
  ## here's one simple solution which is not generic
  row_samp <- df[1,]
  df2 <- do.call("rbind", replicate(length(which_miss), row_samp, simplify= FALSE))
  df2[[time_val]] <- which_miss
  others <- which(names(df2) != time_val)
  df2[, others] <- NA
  return(df2)
}

run

insert_miss(<your_df>)
R> A cond   T Vlog
1 NA   NA 421   NA
2 NA   NA 422   NA

how to insert missing data in the dataframe?

For json to dictionary you can directly use json normalize and then apply set_index on it and set id as index. Then on the new dataframe apply np.re_index and np.arange

import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
data = [{"id":77,"value":"hello"},{"id":5,"value":"HI"},{"id":1,"value":"whats up"},{"id":2,"value":"what"},{"id":120,"value":"hello"},{"id":170,"value":"hello"},{"id":190,"value":"hello"}]
df = json_normalize(data)
new_df = df.set_index('id')
new_df.reindex(np.arange(df.id.min(), df.id.max() + 1)).fillna('space')

Inserting rows into data frame when values missing in category

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1   a       1 0.52042809
2   b       1 0.00000000
3   c       1 0.46373882
4   a       2 0.11155348
5   b       2 0.04937618
6   c       2 0.26433153
7   a       3 0.69100939
8   b       3 0.90596172
9   c       3 0.00000000

Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
  spread(product, sales, fill = 0) %>% 
  gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.

Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

adding missing observations in data.table

I believe the issue is that CJ(l, l, 1994:1995) has duplicate names. This is hinted at by verbose=TRUE:

DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)

This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.

Anyway, you can get around this by naming the CJ arguments:

DT[CJ(from = l, to = l, year = 1994:1995)]
#     from to year          g
#  1:    a  a 1994 0.64364200
#  2:    a  a 1995         NA
#  3:    a  b 1994 0.69746294
#  4:    a  b 1995 0.56863539
#  5:    a  c 1994 0.64369566
#  6:    a  c 1995         NA
#  7:    b  a 1994 0.62198311
#  8:    b  a 1995 0.71919139
#  9:    b  b 1994 0.76170866
# 10:    b  b 1995 0.84792449
# 11:    b  c 1994 0.15793127
# 12:    b  c 1995 0.26623733
# 13:    c  a 1994 0.89921463
# 14:    c  a 1995 0.55417635
# 15:    c  b 1994 0.38938166
# 16:    c  b 1995 0.03778206
# 17:    c  c 1994 0.48918988
# 18:    c  c 1995 0.75206221

Note that we could also accomplish this without keys:

setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
#   it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]

How to merge two data frames with missing values?

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
                   mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
#     main main_cost  rating          combo have_it distance_mi
# 1 burger         7    fine   burger_fries   FALSE          56
# 2 burger         8   great    burger_coke    TRUE          20
# 3  pizza        11   great      pizza_veg   FALSE          40
# 4  pizza        13     bad   pizza_bagels    TRUE          14
# 5  pizza         3    fine    pizza_rolls   FALSE          12
# 6  salad        10  decent salad_dressing    TRUE          78
# 7  salad         5   great    salad_fruit   FALSE          66
# 8  steak         4    okay   steak_cheese    TRUE          30
# 9  steak         7 awesome     steak_mash   FALSE          19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.

Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls", 
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak", 
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine", 
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak", 
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine", 
"okay", "awesome", "decent", "great", "fine", "bad", "great"), 
    distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
    )), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9"))

Insert missing time rows into a dataframe

You can try merge/expand.grid

 res <- merge(
          expand.grid(group=unique(df$group), time=unique(df$time)),
                                     df, all=TRUE)
 res$data[is.na(res$data)] <- 0
 res
 #  group time data
 #1     A    1    5
 #2     A    2    6
 #3     A    3    0
 #4     A    4    7
 #5     B    1    8
 #6     B    2    9
 #7     B    3   10
 #8     B    4    0

Or using data.table

 library(data.table)
 setkey(setDT(df), group, time)[CJ(group=unique(group), time=unique(time))
                     ][is.na(data), data:=0L]
 #    group time data
 #1:     A    1    5
 #2:     A    2    6
 #3:     A    3    0
 #4:     A    4    7
 #5:     B    1    8
 #6:     B    2    9
 #7:     B    3   10
 #8:     B    4    0

Update

As @thelatemail mentioned in the comments, the above method would fail if a particular 'time' value is not present in all the groups. May be this would be more general.

 res <- merge(
          expand.grid(group=unique(df$group), 
                      time=min(df$time):max(df$time)),
                                     df, all=TRUE)
 res$data[is.na(res$data)] <- 0

and similarly replace time=unique(time) with time= min(time):max(time) in the data.table solution.

How to Insert Missing Observations on a Data Frame