How to Insert Missing Observations on a Data Frame

how to insert missing observations on a data frame

This largely depends on how general you wish your solution to be. But, if you want a non-general solution you can do #1 pretty simply. Here, I assume that you're using T as your time variable.

insert_miss <- function(df, time_val= "T", by= 1) {
val <- get(time_val, envir= as.environment(df))
val_range <- range(val)
comp <- seq(val_range[1], val_range[2], by=by)
which_miss <- comp[!comp %in% val]
# generating a sample row depends a lot on your particular problem
# also, specifically how to impute the missing values depends on your
# specific problem / domain
## here's one simple solution which is not generic
row_samp <- df[1,]
df2 <- do.call("rbind", replicate(length(which_miss), row_samp, simplify= FALSE))
df2[[time_val]] <- which_miss
others <- which(names(df2) != time_val)
df2[, others] <- NA
return(df2)
}

run

insert_miss(<your_df>)
R> A cond T Vlog
1 NA NA 421 NA
2 NA NA 422 NA

how to insert missing data in the dataframe?

For json to dictionary you can directly use json normalize and then apply set_index on it and set id as index. Then on the new dataframe apply np.re_index and np.arange

import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
data = [{"id":77,"value":"hello"},{"id":5,"value":"HI"},{"id":1,"value":"whats up"},{"id":2,"value":"what"},{"id":120,"value":"hello"},{"id":170,"value":"hello"},{"id":190,"value":"hello"}]
df = json_normalize(data)
new_df = df.set_index('id')
new_df.reindex(np.arange(df.id.min(), df.id.max() + 1)).fillna('space')

Inserting rows into data frame when values missing in category

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000


Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.


Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

adding missing observations in data.table

I believe the issue is that CJ(l, l, 1994:1995) has duplicate names. This is hinted at by verbose=TRUE:

DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu)
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)

This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.

Anyway, you can get around this by naming the CJ arguments:

DT[CJ(from = l, to = l, year = 1994:1995)]
# from to year g
# 1: a a 1994 0.64364200
# 2: a a 1995 NA
# 3: a b 1994 0.69746294
# 4: a b 1995 0.56863539
# 5: a c 1994 0.64369566
# 6: a c 1995 NA
# 7: b a 1994 0.62198311
# 8: b a 1995 0.71919139
# 9: b b 1994 0.76170866
# 10: b b 1995 0.84792449
# 11: b c 1994 0.15793127
# 12: b c 1995 0.26623733
# 13: c a 1994 0.89921463
# 14: c a 1995 0.55417635
# 15: c b 1994 0.38938166
# 16: c b 1995 0.03778206
# 17: c c 1994 0.48918988
# 18: c c 1995 0.75206221

Note that we could also accomplish this without keys:

setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
# it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]

How to merge two data frames with missing values?

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.


Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))

Insert missing time rows into a dataframe

You can try merge/expand.grid

 res <- merge(
expand.grid(group=unique(df$group), time=unique(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0
res
# group time data
#1 A 1 5
#2 A 2 6
#3 A 3 0
#4 A 4 7
#5 B 1 8
#6 B 2 9
#7 B 3 10
#8 B 4 0

Or using data.table

 library(data.table)
setkey(setDT(df), group, time)[CJ(group=unique(group), time=unique(time))
][is.na(data), data:=0L]
# group time data
#1: A 1 5
#2: A 2 6
#3: A 3 0
#4: A 4 7
#5: B 1 8
#6: B 2 9
#7: B 3 10
#8: B 4 0

Update

As @thelatemail mentioned in the comments, the above method would fail if a particular 'time' value is not present in all the groups. May be this would be more general.

 res <- merge(
expand.grid(group=unique(df$group),
time=min(df$time):max(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0

and similarly replace time=unique(time) with time= min(time):max(time) in the data.table solution.



Related Topics



Leave a reply



Submit