how to insert missing observations on a data frame
This largely depends on how general you wish your solution to be. But, if you want a non-general solution you can do #1 pretty simply. Here, I assume that you're using T
as your time variable.
insert_miss <- function(df, time_val= "T", by= 1) {
val <- get(time_val, envir= as.environment(df))
val_range <- range(val)
comp <- seq(val_range[1], val_range[2], by=by)
which_miss <- comp[!comp %in% val]
# generating a sample row depends a lot on your particular problem
# also, specifically how to impute the missing values depends on your
# specific problem / domain
## here's one simple solution which is not generic
row_samp <- df[1,]
df2 <- do.call("rbind", replicate(length(which_miss), row_samp, simplify= FALSE))
df2[[time_val]] <- which_miss
others <- which(names(df2) != time_val)
df2[, others] <- NA
return(df2)
}
run
insert_miss(<your_df>)
R> A cond T Vlog
1 NA NA 421 NA
2 NA NA 422 NA
how to insert missing data in the dataframe?
For json to dictionary you can directly use json normalize and then apply set_index on it and set id as index. Then on the new dataframe apply np.re_index and np.arange
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
data = [{"id":77,"value":"hello"},{"id":5,"value":"HI"},{"id":1,"value":"whats up"},{"id":2,"value":"what"},{"id":120,"value":"hello"},{"id":170,"value":"hello"},{"id":190,"value":"hello"}]
df = json_normalize(data)
new_df = df.set_index('id')
new_df.reindex(np.arange(df.id.min(), df.id.max() + 1)).fillna('space')
Inserting rows into data frame when values missing in category
Option 1
Thanks to @Frank for the better solution, using tidyr
:
library(tidyr)
complete(df, day, product, fill = list(sales = 0))
Using this approach, you no longer need to worry about selecting product names, etc.
Which gives you:
day product sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000
Option 2
You can do this using the tidyr
package (and dplyr
)
df %>%
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")
Which gives the same result
This works by using spread
to create a wide data frame, with each product as its own column. The argument fill = 0
will cause all empty cells to be filled with a 0
(the default is NA
).
Next, gather
works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'
). We then set the key
and value
to the original column names.
I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.
Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr
and then using the above tidyr
to do the rest.
adding missing observations in data.table
I believe the issue is that CJ(l, l, 1994:1995)
has duplicate names. This is hinted at by verbose=TRUE
:
DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu)
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.
Anyway, you can get around this by naming the CJ
arguments:
DT[CJ(from = l, to = l, year = 1994:1995)]
# from to year g
# 1: a a 1994 0.64364200
# 2: a a 1995 NA
# 3: a b 1994 0.69746294
# 4: a b 1995 0.56863539
# 5: a c 1994 0.64369566
# 6: a c 1995 NA
# 7: b a 1994 0.62198311
# 8: b a 1995 0.71919139
# 9: b b 1994 0.76170866
# 10: b b 1995 0.84792449
# 11: b c 1994 0.15793127
# 12: b c 1995 0.26623733
# 13: c a 1994 0.89921463
# 14: c a 1995 0.55417635
# 15: c b 1994 0.38938166
# 16: c b 1995 0.03778206
# 17: c c 1994 0.48918988
# 18: c c 1995 0.75206221
Note that we could also accomplish this without keys:
setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
# it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]
How to merge two data frames with missing values?
You may try to impute missing values in df1
with adjacent non-missings of df2
. Then just merge
, where "main"
, "main_cost"
, and "rating"
columns will automatically be selected. Just "main"
would be insufficient, because there are ties.
df1[3:4] <- lapply(names(df2)[3:4], \(z)
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))
(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19
Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating"
column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost")
where you will end up with "rating.x"
and "rating.y"
, though.
Data:
df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese",
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))
Insert missing time rows into a dataframe
You can try merge/expand.grid
res <- merge(
expand.grid(group=unique(df$group), time=unique(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0
res
# group time data
#1 A 1 5
#2 A 2 6
#3 A 3 0
#4 A 4 7
#5 B 1 8
#6 B 2 9
#7 B 3 10
#8 B 4 0
Or using data.table
library(data.table)
setkey(setDT(df), group, time)[CJ(group=unique(group), time=unique(time))
][is.na(data), data:=0L]
# group time data
#1: A 1 5
#2: A 2 6
#3: A 3 0
#4: A 4 7
#5: B 1 8
#6: B 2 9
#7: B 3 10
#8: B 4 0
Update
As @thelatemail mentioned in the comments, the above method would fail if a particular 'time' value is not present in all the groups. May be this would be more general.
res <- merge(
expand.grid(group=unique(df$group),
time=min(df$time):max(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0
and similarly replace time=unique(time)
with time= min(time):max(time)
in the data.table solution.
Related Topics
Create Columns from Column of List in Data.Table
Extract Certain Files from .Zip
Adding Labels on Curves in Glmnet Plot in R
Create an Arrow with Gradient Color
Addsma Not Drawn on Graph When Called from Function
Connect R and Vertica Using Rodbc
How to Create a Bar and Line Plot with R Dygraphs
Converting to Date in a Character Column That Contains Two Date Formats
The Representation of an Empty Argument in a "Call"
Sum Specific Columns Among Rows
Read CSV with Two Headers into a Data.Frame
What Are the Ways to Create an Executable from R Program
Replacing Negative Values in a Model (System of Odes) with Zero
Using Lm in List Column to Predict New Values Using Purrr