Inserting Rows into Data Frame When Values Missing in Category

Python insert rows into a data-frame when values missing in field

Try this:

import pandas as pd
import numpy as np
df=pd.DataFrame({'seq':[0,1,2,3,4,5], 'location':['cal','cal','cal','il','il','il'],'lat':[29,29.1,28.2,15.2,15.6,14], 'lon':[-95,-98,-95.6,-88, -87.5,-88.9], 'name': ['mike', 'john', 'tyler', 'rob', 'ashley', 'john']})

df_new1 = pd.DataFrame({'location' : ['warehouse'], 'lat': [22], 'lon': [-50]}) # sample data row1
df = pd.concat([df_new1, df], sort=False).reset_index(drop = True)
print(df)

df_new2 = pd.DataFrame({'location' : ['abc'], 'lat': [28], 'name': ['abcd']}) # sample data row2
df = pd.concat([df_new2, df], sort=False).reset_index(drop = True)
print(df)

output:

    lat   location   lon    name  seq
0 22.0 warehouse -50.0 NaN NaN
0 29.0 cal -95.0 mike 0.0
1 29.1 cal -98.0 john 1.0
2 28.2 cal -95.6 tyler 2.0
3 15.2 il -88.0 rob 3.0
4 15.6 il -87.5 ashley 4.0
5 14.0 il -88.9 john 5.0

lat location name lon seq
0 28.0 abc abcd NaN NaN
1 22.0 warehouse NaN -50.0 NaN
2 29.0 cal mike -95.0 0.0
3 29.1 cal john -98.0 1.0
4 28.2 cal tyler -95.6 2.0
5 15.2 il rob -88.0 3.0
6 15.6 il ashley -87.5 4.0
7 14.0 il john -88.9 5.0

Inserting rows into data frame when values missing in category

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000


Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.


Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

Add row for each group with missing value

Convert your data.frame to wide format, filling it with 0s instead of NAs, then convert it back to tall format:

count <- c(5,5,7,3,2,6,4)       # should be integers, not strings
data <- data.frame(Basket,Fruit,count)

d1 <- tidyr::spread( data, Fruit, count, fill = 0 )
d2 <- tidyr::gather( d1, Fruit, count, -Basket )

inserting missing categories and dates in pandas dataframe

Your solution is possible modify with add date columns by unique values, this solution working if not unique triples date, group, score in input data:

cats = ['high', 'mid','low'] 
x_re = pd.DataFrame(list(product(x['date'].unique(),
x['group'].unique(),
cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)

Solution with reindex by 3 level MultiIndex is similar:

cats = ['high', 'mid','low'] 
x_re = pd.MultiIndex.from_product([x['date'].unique(),
x['group'].unique(),
cats],names=['date','group', 'score'])

x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a mid NaN
2 2020-06-01 a low 13.0
3 2020-06-01 b high NaN
4 2020-06-01 b mid NaN
5 2020-06-01 b low 19.0
6 2020-06-01 c high 3.0
7 2020-06-01 c mid NaN
8 2020-06-01 c low NaN
9 2020-06-01 d high NaN
10 2020-06-01 d mid NaN
11 2020-06-01 d low NaN
12 2020-06-02 a high NaN
13 2020-06-02 a mid 2.0
14 2020-06-02 a low NaN
15 2020-06-02 b high 22.0
16 2020-06-02 b mid NaN
17 2020-06-02 b low NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c mid 49.0
20 2020-06-02 c low NaN
21 2020-06-02 d high 12.0
22 2020-06-02 d mid NaN
23 2020-06-02 d low NaN

With one call unstack and one call stack is possible use, but is necessary all unique values cats have to exist in input data:

x = (x.set_index(['date', 'group', 'score'])
.unstack(['group','score'])
.stack([1, 2], dropna=False)
.reset_index())
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN

Inserting NA rows when missing data

Not very elegant, but this is how I would do it:

Seq<-c(1,2,3,4,6,7,10,11,12,18,19,20)
Data<-c(3,4,5,4,3,2,1,2,3,5,4,3)
DF<-data.frame(Seq, Data)

first <- DF$Seq
second <- DF$Data

for(i in length(first):2) {
gap <- first[i] - first[i - 1]
if(gap > 2) {
steps <- ifelse(gap %% 2 == 1, gap %/% 2, (gap %/% 2) -1)
new_values_gap <- gap / (steps + 1)
new_values <- vector('numeric')
for(j in 1:steps) {
new_values <- c(new_values, first[i - 1] + j * new_values_gap)
}
first <- c(first[1:i - 1], new_values, first[i:length(first)])
second <- c(second[1:i - 1], rep(NA, length(new_values)), second[i:length(second)])
}
}

NewDF <- data.frame(NewSeq = first, NewData = second)

> NewDF

## NewSeq NewData
## 1 1.0 3
## 2 2.0 4
## 3 3.0 5
## 4 4.0 4
## 5 6.0 3
## 6 7.0 2
## 7 8.5 NA
## 8 10.0 1
## 9 11.0 2
## 10 12.0 3
## 11 14.0 NA
## 12 16.0 NA
## 13 18.0 5
## 14 19.0 4
## 15 20.0 3

How to add the missing rows from one dataframe to another based on condition in Pandas?

  1. you can concat df1 with the records in df2 which are not in df1 : df2[~df2.isin(df1)].dropna()
  2. you then sort your values and reset_index

Long story short, you could do it in one line :

pd.concat([df1, df2[~df2.isin(df1)].dropna()]).sort_values(['index','type','class']).reset_index(drop=True)

Will give the following output:

    index   type    class
0 001 red A
1 001 red A
2 001 red A
3 002 yellow A
4 002 red A
5 003 green A
6 003 green B
7 004 blue A
8 004 blue A

Pandas - insert rows where data is missing

Create a MultiIndex and reindex + reset_index:

idx = pd.MultiIndex.from_product([df['Team'].unique(), 
np.arange(5, df['Seconds_left'].max()+1, 5)],
names=['Team', 'Seconds_left'])

df.set_index(['Team', 'Seconds_left']).reindex(idx).reset_index()
Out:
Team Seconds_left Fouls
0 ATL 5 1.0
1 ATL 10 2.0
2 ATL 15 3.0
3 ATL 20 NaN
4 ATL 25 3.0
5 ATL 30 4.0
6 ATL 35 5.0
7 SAS 5 5.0
8 SAS 10 4.0
9 SAS 15 1.0
10 SAS 20 NaN
11 SAS 25 NaN
12 SAS 30 1.0
13 SAS 35 NaN

Interpolate and insert missing rows into dataframe R

one approach, adapt to your case as appropriate:

library(dplyr)
library(lubridate) ## facilitates date-time manipulations

## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)

## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')

## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)

## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))

edit
One possibility to granulate generated data (missing dates) with additional parameters (e. g. measuring depths) is to use expand.grid as follows. Assuming object names from previous code:

## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example

## generate dataframe with missing dates x depths:
missing_dates_and_depths <-
setNames(expand.grid(missing_dates, observation_depths),
c('date','depthR')
)

## stack both dataframes as above:
full_data <-
patchy_data %>%
bind_rows(missing_dates_and_depths) %>%
arrange(date)

Insert missing rows by factor level

Use expand.grid to make a master list and then merge:

alllevs <- do.call(expand.grid, lapply(dat[c("Type","Category")], levels))
merge(dat, alllevs, all.y=TRUE)

# Category Type Number Count
#1 X A 1 10
#2 X B 2 14
#3 Y A NA NA
#4 Y B 3 3
#5 Z A 4 14
#6 Z B NA NA


Related Topics



Leave a reply



Submit