Merge Data.Frames Based on Year and Fill in Missing Values

merge data.frames based on year and fill in missing values

You can use match

matched<-match(format(sample$Date,"%Y"),a$a)
sample$y<-a$y[matched]
sample$Z<-a$Z[matched]

Merging data frame and filling missing values

You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.

new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1

new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5

A tidyverse way with the same logic :

library(dplyr)
library(purrr)

list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))

Merging two dataframes pandas on Id and year where year is missing values

merge has the correct output , we just need to order and sort_values

s=pd.merge(df1,df2,on=['Id','year'], how = 'outer').\
sort_index(level=0,axis=1).sort_values(['Id', 'year']).fillna(0)
s
Out[81]:
A B C D year Id
3 100.0 0.0 1.0 0.0 2009 1
0 75.0 15.0 7.0 33.0 2010 1
1 0.0 24.0 0.0 72.0 2011 1
2 60.0 30.0 3.0 16.0 2012 1
4 42.0 0.0 4.0 0.0 2013 1

How to merge two pandas dataframes on index but fill missing values

Try to concatenate on rows and fill NaNs with 0

pd.concat([df,df1], axis=1).fillna(0)



x y
0 1 0.0
1 1 1.0
2 1 0.0
3 1 1.0
4 1 0.0

Merge and fill missing values based on multiple columns from another dataframe in Python

As mentioned in the question you can also use update depending on your data and needs:

df1 = df1.set_index(['year', 'city'])
df1.update(
df2
.set_index(['year', 'city'])\
.rename(columns={'gdp':'gdp_value','rate':'growth_rate'})\
)
df1 = df1.reset_index()

One way is to use combine_first with set_index and column renaming:

df1.set_index(['year','city'])\
.combine_first(df2.set_index(['year','city'])
.rename(columns={'gdp':'gdp_value','rate':'growth_rate'}))\
.reset_index()

Output:

   year city  gdp_value  growth_rate
0 2015 bj 7.0 0.01
1 2015 sh 6.0 0.04
2 2016 bj 3.0 0.03
3 2016 sh 5.0 0.07
4 2017 bj 2.0 -0.03
5 2017 sh 3.0 -0.03
6 2018 bj 5.0 0.05
7 2018 sh 6.0 0.05
8 2019 bj 4.0 0.02
9 2019 sh 4.0 0.02

Pandas: How to merge two data frames and fill NaN values using values from the second data frame

Use left join with suffixes parameter and then replace missing values by Series.fillna with DataFrame.pop for use and drop column Expected_:

df = df1.merge(df2, on=['No','pl.'], how='left', suffixes=('_',''))
df['Expected'] = df.pop('Expected_').fillna(df['Expected'])
print (df)
No car pl. Value Expected
0 1 Toyota HK 0.1 0.12
1 1 Toyota NY 0.2 NaN
2 2 Saab LOS 0.3 0.35
3 2 Saab UK 0.4 0.60
4 2 Saab HK 0.5 0.51
5 3 Audi NYU 0.6 0.62
6 3 Audi LOS 0.7 0.76
7 4 VW UK 0.8 NaN
8 5 Audi HK 0.9 0.91

Merge two data frames to fill in missing dates

One way, with dplyr:

library(dplyr)
df3 <- df1 %>% filter(year < 1920) %>%
left_join(filter(df2, year == 1910) %>% select(-year))
df3 <- df1 %>% filter(year >= 1920) %>%
left_join(filter(df2, year == 1920) %>% select(-year)) %>%
bind_rows(df3) %>%
arrange(year, state)

It's split into two chains, one that just joins the pre-1920 data, the other which does the post-1920, joins the two, and sorts.


Update based on comments:

To split the years into 5-year increments and join on df2 values in those increments:

df1$year_factor <- cut(df1$year, seq(1900, 1950, 5), right = FALSE)
df2$year_factor <- cut(df2$year, seq(1900, 1950, 5), right = FALSE)
df3 <- df1 %>% left_join(select(df2, -year)) %>% select(-year_factor)

This is actually simpler, but it introduces (and removes) a dummy variable, and cut can be a little finicky; play with it as you like. It produces:

   year      state acre_yield          w
1 1910 colorado 15.5 0.11777361
2 1910 kansas 19 0.33202730
3 1910 new mexico 15 0.01760644
4 1910 oklahoma 16 0.49216919
5 1910 texas 22 0.04042345
6 1911 colorado 14 0.11777361
7 1911 kansas 14.5 0.33202730
8 1911 new mexico 19.5 0.01760644
9 1911 oklahoma 7 0.49216919
10 1911 texas 11 0.04042345
11 1919 texas 23 NA
12 1920 colorado 18.5 0.30557449
13 1920 kansas 26.2 0.32107132
14 1920 new mexico 20 0.05836014
15 1920 oklahoma 26 0.26414535
16 1920 texas 20 0.05084870
17 1921 colorado 12 0.30557449
18 1921 kansas 22.8 0.32107132
19 1921 new mexico 19.5 0.05836014
20 1921 oklahoma 23 0.26414535
21 1921 texas 18 0.05084870

Note the one NA value for the 1919 row; since df2 doesn't have any values between 1915 and 1919, there's nothing to insert. To go by decades, change the 5 in seq to 10, or otherwise set as you prefer.

merge two uneven dataframes by ID and fill in missing values

We can use {powerjoin} :

library(powerjoin)

power_full_join(df1, df2, by = "ID", conflict = coalesce_xy)
#> ID x z y
#> 1 a 5 NA NA
#> 2 b 6 NA 6
#> 3 c 7 NA 5
#> 4 d 8 4 7
#> 5 e 9 3 8
#> 6 f NA NA 9
#> 7 g NA 2 10
#> 8 h NA 1 11

Created on 2022-04-14 by the reprex package (v2.0.1)

How to merge two data frames with missing values?

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.


Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))

Merge unequal dataframes and replace missing rows with 0

Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():

zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0

> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0

Updated many years later to address follow up question

You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:

df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))

#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0

Created on 2019-01-03 by the reprex package (v0.2.1)



Related Topics



Leave a reply



Submit