Inserting Rows into Data Frame When Values Missing in Category

Python insert rows into a data-frame when values missing in field

Try this:

import pandas as pd
import numpy as np
df=pd.DataFrame({'seq':[0,1,2,3,4,5], 'location':['cal','cal','cal','il','il','il'],'lat':[29,29.1,28.2,15.2,15.6,14], 'lon':[-95,-98,-95.6,-88, -87.5,-88.9], 'name': ['mike', 'john', 'tyler', 'rob', 'ashley', 'john']})

df_new1 = pd.DataFrame({'location' : ['warehouse'], 'lat': [22], 'lon': [-50]}) # sample data row1
df = pd.concat([df_new1, df], sort=False).reset_index(drop = True)
print(df) 

df_new2 = pd.DataFrame({'location' : ['abc'], 'lat': [28], 'name': ['abcd']}) # sample data row2
df = pd.concat([df_new2, df], sort=False).reset_index(drop = True) 
print(df)

output:

    lat   location   lon    name  seq
0  22.0  warehouse -50.0     NaN  NaN
0  29.0        cal -95.0    mike  0.0
1  29.1        cal -98.0    john  1.0
2  28.2        cal -95.6   tyler  2.0
3  15.2         il -88.0     rob  3.0
4  15.6         il -87.5  ashley  4.0
5  14.0         il -88.9    john  5.0

    lat   location    name   lon  seq
0  28.0        abc    abcd   NaN  NaN
1  22.0  warehouse     NaN -50.0  NaN
2  29.0        cal    mike -95.0  0.0
3  29.1        cal    john -98.0  1.0
4  28.2        cal   tyler -95.6  2.0
5  15.2         il     rob -88.0  3.0
6  15.6         il  ashley -87.5  4.0
7  14.0         il    john -88.9  5.0

Inserting rows into data frame when values missing in category

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1   a       1 0.52042809
2   b       1 0.00000000
3   c       1 0.46373882
4   a       2 0.11155348
5   b       2 0.04937618
6   c       2 0.26433153
7   a       3 0.69100939
8   b       3 0.90596172
9   c       3 0.00000000

Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
  spread(product, sales, fill = 0) %>% 
  gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.

Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

Add row for each group with missing value

Convert your data.frame to wide format, filling it with 0s instead of NAs, then convert it back to tall format:

count <- c(5,5,7,3,2,6,4)       # should be integers, not strings
data <- data.frame(Basket,Fruit,count)

d1 <- tidyr::spread( data, Fruit, count, fill = 0 )
d2 <- tidyr::gather( d1, Fruit, count, -Basket )

inserting missing categories and dates in pandas dataframe

Your solution is possible modify with add date columns by unique values, this solution working if not unique triples date, group, score in input data:

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['date'].unique(), 
                                 x['group'].unique(), 
                                 cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)

Solution with reindex by 3 level MultiIndex is similar:

cats = ['high', 'mid','low'] 
x_re = pd.MultiIndex.from_product([x['date'].unique(), 
                                   x['group'].unique(),
                                   cats],names=['date','group', 'score'])

x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   mid    NaN
2   2020-06-01     a   low   13.0
3   2020-06-01     b  high    NaN
4   2020-06-01     b   mid    NaN
5   2020-06-01     b   low   19.0
6   2020-06-01     c  high    3.0
7   2020-06-01     c   mid    NaN
8   2020-06-01     c   low    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   mid    NaN
11  2020-06-01     d   low    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   mid    2.0
14  2020-06-02     a   low    NaN
15  2020-06-02     b  high   22.0
16  2020-06-02     b   mid    NaN
17  2020-06-02     b   low    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   mid   49.0
20  2020-06-02     c   low    NaN
21  2020-06-02     d  high   12.0
22  2020-06-02     d   mid    NaN
23  2020-06-02     d   low    NaN

With one call unstack and one call stack is possible use, but is necessary all unique values cats have to exist in input data:

x = (x.set_index(['date', 'group', 'score'])
      .unstack(['group','score'])
      .stack([1, 2], dropna=False)
      .reset_index())
print (x)
          date group score  count
0   2020-06-01     a  high   12.0
1   2020-06-01     a   low   13.0
2   2020-06-01     a   mid    NaN
3   2020-06-01     b  high    NaN
4   2020-06-01     b   low   19.0
5   2020-06-01     b   mid    NaN
6   2020-06-01     c  high    3.0
7   2020-06-01     c   low    NaN
8   2020-06-01     c   mid    NaN
9   2020-06-01     d  high    NaN
10  2020-06-01     d   low    NaN
11  2020-06-01     d   mid    NaN
12  2020-06-02     a  high    NaN
13  2020-06-02     a   low    NaN
14  2020-06-02     a   mid    2.0
15  2020-06-02     b  high   22.0
16  2020-06-02     b   low    NaN
17  2020-06-02     b   mid    NaN
18  2020-06-02     c  high    4.0
19  2020-06-02     c   low    NaN
20  2020-06-02     c   mid   49.0
21  2020-06-02     d  high   12.0
22  2020-06-02     d   low    NaN
23  2020-06-02     d   mid    NaN

Inserting NA rows when missing data

Not very elegant, but this is how I would do it:

Seq<-c(1,2,3,4,6,7,10,11,12,18,19,20)
Data<-c(3,4,5,4,3,2,1,2,3,5,4,3)
DF<-data.frame(Seq, Data)

first <- DF$Seq
second <- DF$Data

for(i in length(first):2) {
  gap <- first[i] - first[i - 1]
  if(gap > 2) {
    steps <- ifelse(gap %% 2 == 1, gap %/% 2, (gap %/% 2) -1)
    new_values_gap <- gap / (steps + 1)
    new_values <- vector('numeric')
    for(j in 1:steps) {
      new_values <- c(new_values, first[i - 1] + j * new_values_gap)
    }
    first <- c(first[1:i - 1], new_values, first[i:length(first)])
    second <- c(second[1:i - 1], rep(NA, length(new_values)), second[i:length(second)])
  }
}

NewDF <- data.frame(NewSeq = first, NewData = second)

> NewDF

##    NewSeq NewData
## 1     1.0       3
## 2     2.0       4
## 3     3.0       5
## 4     4.0       4
## 5     6.0       3
## 6     7.0       2
## 7     8.5      NA
## 8    10.0       1
## 9    11.0       2
## 10   12.0       3
## 11   14.0      NA
## 12   16.0      NA
## 13   18.0       5
## 14   19.0       4
## 15   20.0       3

How to add the missing rows from one dataframe to another based on condition in Pandas?

you can concat df1 with the records in df2 which are not in df1 : df2[~df2.isin(df1)].dropna()
you then sort your values and reset_index

Long story short, you could do it in one line :

pd.concat([df1, df2[~df2.isin(df1)].dropna()]).sort_values(['index','type','class']).reset_index(drop=True)

Will give the following output:

    index   type    class
0   001     red     A
1   001     red     A
2   001     red     A
3   002     yellow  A
4   002     red     A
5   003     green   A
6   003     green   B
7   004     blue    A
8   004     blue    A

Pandas - insert rows where data is missing

Create a MultiIndex and reindex + reset_index:

idx = pd.MultiIndex.from_product([df['Team'].unique(), 
                                  np.arange(5, df['Seconds_left'].max()+1, 5)],
                                 names=['Team', 'Seconds_left'])

df.set_index(['Team', 'Seconds_left']).reindex(idx).reset_index()
Out: 
   Team  Seconds_left  Fouls
0   ATL             5    1.0
1   ATL            10    2.0
2   ATL            15    3.0
3   ATL            20    NaN
4   ATL            25    3.0
5   ATL            30    4.0
6   ATL            35    5.0
7   SAS             5    5.0
8   SAS            10    4.0
9   SAS            15    1.0
10  SAS            20    NaN
11  SAS            25    NaN
12  SAS            30    1.0
13  SAS            35    NaN

Interpolate and insert missing rows into dataframe R

one approach, adapt to your case as appropriate:

library(dplyr)
library(lubridate) ## facilitates date-time manipulations

## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
                          value = rnorm(12)) %>%
    arrange(date)

## create vector of -only!- missing dates:
missing_dates <- 
    setdiff(
        seq.Date(from = min(patchy_data$date),
                 to = max(patchy_data$date),
                 by = '1 day'
                 ),
        patchy_data$date
    ) %>% as.Date(origin = '1970-01-01')

## extend initial dataframe with rows per missing date:
full_data <-
    patchy_data %>%
        bind_rows(data.frame(date = missing_dates,
                             value = NA)
                  ) %>%
        arrange(date)

## group by month and impute missing data from monthwise statistic:
full_data %>%
    mutate(month = lubridate::month(date)) %>%
    group_by(month) %>%
    ## coalesce conveniently replaces ifelse-constructs to replace NAs
    mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))

edit
One possibility to granulate generated data (missing dates) with additional parameters (e. g. measuring depths) is to use expand.grid as follows. Assuming object names from previous code:

## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example

## generate dataframe with missing dates x depths:
missing_dates_and_depths  <- 
    setNames(expand.grid(missing_dates, observation_depths),
             c('date','depthR')
             )

## stack both dataframes as above:
full_data <-
    patchy_data %>%
        bind_rows(missing_dates_and_depths) %>%
        arrange(date)

Insert missing rows by factor level

Use expand.grid to make a master list and then merge:

alllevs <- do.call(expand.grid, lapply(dat[c("Type","Category")], levels))
merge(dat, alllevs, all.y=TRUE)

#  Category Type Number Count
#1        X    A      1    10
#2        X    B      2    14
#3        Y    A     NA    NA
#4        Y    B      3     3
#5        Z    A      4    14
#6        Z    B     NA    NA