Fill Missing Combinations in a Dataframe

Adding values for missing data combinations in Pandas

create a MultiIndex by MultiIndex.from_product() and then set_index(), reindex(), reset_index().

import pandas as pd
import io

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
df = pd.read_csv(io.BytesIO("""person_id status year count
0 pass 1980 4
0 fail 1982 1
1 pass 1981 2"""), delim_whitespace=True)
names = ["person_id", "status", "year"]

mind = pd.MultiIndex.from_product(
[all_person_ids, all_statuses, all_years], names=names)
df.set_index(names).reindex(mind, fill_value=0).reset_index()

Fill missing combinations with ones in a groupby object

We can do pivot_table then stack

out = df.pivot_table(index='date',columns='group',values='ret',aggfunc = 'mean').fillna(1).stack().reset_index(name='value')
date group value
0 1986-01-31 1 1.1
1 1986-01-31 2 1.5
2 1986-01-31 3 1.1
3 1986-02-28 1 1.0
4 1986-02-28 2 1.2
5 1986-02-28 3 1.0

Complete dataframe with missing combinations of values

You can use the tidyr::complete function:

complete(df, distance, years = full_seq(years, period = 1), fill = list(area = 0))

# A tibble: 14 x 3
distance years area
<fct> <dbl> <dbl>
1 100 1. 40.
2 100 2. 0.
3 100 3. 0.
4 100 4. 0.
5 100 5. 50.
6 100 6. 60.
7 100 7. 0.
8 NPR 1. 0.
9 NPR 2. 0.
10 NPR 3. 10.
11 NPR 4. 20.
12 NPR 5. 0.
13 NPR 6. 0.
14 NPR 7. 30.

or slightly shorter:

complete(df, distance, years = 1:7, fill = list(area = 0))

Pandas: Create missing combination rows with zero values

Another way using unstack with fill_value=0 and stack, reset_index

df.set_index(['col1','col2']).unstack(fill_value=0).stack().reset_index()

Out[311]:
col1 col2 value
0 1 A 2
1 1 B 4
2 1 C 0
3 2 A 6
4 2 B 8
5 2 C 10

Fill missing combinations in a dataframe

Using complete from tidyr:

library(tidyr)
as.data.frame(complete(df,REGION,CATEGORY,fill=list(VALUE1=0,VALUE2=0)))

Output:

    REGION CATEGORY VALUE1 VALUE2
1 REGION A A 2 1
2 REGION A B 3 2
3 REGION B A 0 0
4 REGION B B 4 3

If there are many variables, you could also just do as.data.frame(complete(df,REGION,CATEGORY)) and replace the NA's afterwards.

Hope this helps!

Fill a list/pandas.dataframe with all the missing data combinations (like complete() in R)

You could use a reindex.

First you'll need a list of the valid (type, food) pairs. I'll get it from the data itself, rather than writing them out.

In [88]: kinds = list(df[['Type', 'Food']].drop_duplicates().itertuples(index=False))

In [89]: kinds
Out[89]:
[('Fruit', 'Banana'),
('Fruit', 'Apple'),
('Vegetable', 'Broccoli'),
('Vegetable', 'Lettuce'),
('Vegetable', 'Peppers'),
('Vegetable', 'Corn'),
('Seasoning', 'Olive Oil'),
('Seasoning', 'Vinegar')]

Now we'll generate all the pairs for those kinds with the houses using itertools.product.

In [93]: from itertools import product

In [94]: houses = ['House-%s' % x for x in range(1, 8)]

In [95]: idx = [(x.Type, x.Food, house) for x, house in product(kinds, houses)]

In [96]: idx[:2]
Out[96]: [('Fruit', 'Banana', 'House-1'), ('Fruit', 'Banana', 'House-2')]

And now you can use set_index and reindex to get the missing observations.

In [98]: df.set_index(['Type', 'Food', 'Loc']).reindex(idx, fill_value=0)
Out[98]:
Num
Type Food Loc
Fruit Banana House-1 15
House-2 4
House-3 0
House-4 0
House-5 0
... ...
Seasoning Vinegar House-3 0
House-4 0
House-5 0
House-6 0
House-7 2

[56 rows x 1 columns]

How to complete data frame missing combinations while accounting for the missing ones

Here is a tidyverse solution:
First we create a copy of num then we use complete together with nesting:

library(dplyr)
library(tidyr)

df %>%
mutate(num_new = num) %>%
complete(lttrs, nesting(num_new)) %>%
data.frame()
 lttrs num_new num
1 a 1 1
2 a 2 2
3 a 3 NA
4 a 4 4
5 a 5 5
6 a 6 NA
7 a 7 7
8 a 8 NA
9 a 9 NA
10 a 10 NA
11 b 1 1
12 b 2 2
13 b 3 3
14 b 4 NA
15 b 5 NA
16 b 6 NA
17 b 7 7
18 b 8 NA
19 b 9 9
20 b 10 NA
21 c 1 NA
22 c 2 NA
23 c 3 3
24 c 4 NA
25 c 5 5
26 c 6 6
27 c 7 7
28 c 8 NA
29 c 9 NA
30 c 10 10
31 d 1 NA
32 d 2 2
33 d 3 NA
34 d 4 4
35 d 5 5
36 d 6 NA
37 d 7 NA
38 d 8 8
39 d 9 9
40 d 10 NA
41 e 1 1
42 e 2 2
43 e 3 3
44 e 4 NA
45 e 5 NA
46 e 6 NA
47 e 7 NA
48 e 8 8
49 e 9 9
50 e 10 NA

Pandas: fill missing value based on combination in dataframe


df = df.replace('missing', np.nan).sort_values(['postal code', 'district'])
df.groupby('postal code').ffill().sort_index()

postal code district
0 10001 North
1 10002 West
2 10001 North

I sort because np.nan will be placed at the end and ready to be forward filled.

Filling Missing Dates for a combination of columns

You can use:

from  itertools import product

#get all unique combinations of columns
COLS_COMBO = df_1[['COL1','COL2']].drop_duplicates().values.tolist()
#remove times and create MS date range
dates = df_1['Date'].dt.floor('d')
months_range = pd.date_range(dates.min(), dates.max(), freq='MS')
print(COLS_COMBO)
print(months_range)

#create all combinations of values
df = pd.DataFrame([(c, a, b) for (a, b), c in product(COLS_COMBO, months_range)],
columns=['Date','COL1','COL2'])
print (df)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 1
2 2018-03-01 A 1
3 2018-04-01 A 1
4 2018-05-01 A 1
5 2018-01-01 A 2
6 2018-02-01 A 2
7 2018-03-01 A 2
8 2018-04-01 A 2
9 2018-05-01 A 2
10 2018-01-01 B 1
11 2018-02-01 B 1
12 2018-03-01 B 1
13 2018-04-01 B 1
14 2018-05-01 B 1
15 2018-01-01 B 2
16 2018-02-01 B 2
17 2018-03-01 B 2
18 2018-04-01 B 2
19 2018-05-01 B 2

#add to original df_1 and remove duplicates
df_1 = pd.concat([df_1, df], ignore_index=True).drop_duplicates()
print (df_1)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
6 2018-02-01 A 1
7 2018-03-01 A 1
8 2018-04-01 A 1
10 2018-01-01 A 2
12 2018-03-01 A 2
13 2018-04-01 A 2
14 2018-05-01 A 2
15 2018-01-01 B 1
16 2018-02-01 B 1
18 2018-04-01 B 1
19 2018-05-01 B 1
20 2018-01-01 B 2
21 2018-02-01 B 2
22 2018-03-01 B 2
23 2018-04-01 B 2


Related Topics



Leave a reply



Submit