Merge Data Frames and Overwrite Values

Merge data frames and overwrite values

merdat <- merge(dfrm1,dfrm2, by="Date")  # seems self-documenting

# explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]

Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.

In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep() that has the same length as sum(logical.vector). The goal is both instances is to have the same length (and order) for assignment as the items being replaced.

Combine dataframes and overwrite values in table 1 with all values in table 2

You can remove SN values in x that match SN values in y then row bind the the two dataframes.

rbind(x[!x$SN %in% y$SN,], y) 

SN Age Name
1 1 21 John
2 2 15 Dora
3 3 44 <NA>
4 4 100 B

Merge with overwrite values from left dataframe in pandas

You will have to replace the rows to override the values in place. This is different from drop duplicates as it will change the ordering of the rows.

Combine DFs takes in "pkey" as an argument, which is the main column on which the merge should happen.

def update_df_row(row=None, col_name="", df=pd.DataFrame(), pkey=""):
try:
match_index = df.loc[df[pkey] == col_name].index[0]
row = df.loc[match_index]
except IndexError:
pass
except Exception as ex:
raise
finally:
return row

def combine_dfs(parent_df, child_df, pkey):

filtered_child_df = child_df[child_df[pkey].isin(parent_df[pkey])]

parent_df[parent_df[pkey].isin(child_df[pkey])] = parent_df[
parent_df[pkey].isin(child_df[pkey])].apply(
lambda row: update_df_row(row, row[pkey], filtered_child_df, pkey), axis=1)

parent_df = pd.concat([parent_df, child_df]).drop_duplicates([pkey])

return parent_df.reset_index(drop=True)

The output of the above code snippet will be:

    A   B
0 AD 1
1 CD 4
2 EF 3
3 GH 5

Merge R data frame or data table and overwrite values of multiple columns

You can do this by using dplyr::coalesce, which will return the first non-missing value from vectors.

(EDIT: you can use dplyr::coalesce directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)

Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x data frame if they are present).

coalesce_join <- function(x, 
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))

to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))

coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce

dplyr::bind_cols(joined, coalesced)[cols]
}

Pandas merging dataframes, overwriting values on key

Simply use pd.concat with indexes that are not in df2's indexes. Let

df  = df.set_index('id')
df2 = df2.set_index('id')

Then

>>> merged = pd.concat([df[~df.index.isin(df2.index)], df2]).reset_index()

id value
0 100010 25
1 100011 22
2 100012 35
3 100013 36

join or merge with overwrite in pandas

How about: df2.combine_first(df1)?

In [33]: df2
Out[33]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598

In [34]: df1
Out[34]:
A B C
2000-01-03 2.288863 0.188175 -0.040928
2000-01-04 0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475 1.113018 -0.001483
2000-01-07 0.303269 0.021034 0.471715
2000-01-10 1.149815 0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238 0.626968
2000-01-13 -0.193280 1.857499 -0.046383
2000-01-14 -1.043492 -0.820525 0.868685

In [35]: df2.comb
df2.combine df2.combineAdd df2.combine_first df2.combineMult

In [35]: df2.combine_first(df1)
Out[35]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
2000-01-12 -1.034574 -0.523238 0.626968 NaN
2000-01-13 -0.193280 1.857499 -0.046383 NaN
2000-01-14 -1.043492 -0.820525 0.868685 NaN

Note that it takes the values from df1 for indices that do not overlap with df2. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.

Pandas merge with overwrite and union together

concat then drop_duplicates:

pd.concat((new,old)).drop_duplicates('Key1')

Or with isin check which ids are not present in the old df and then concat:

pd.concat((new,old[~old['Key1'].isin(new['Key1'])]))


  Key1       Key2
0 L1 Value1
1 L2 Value2
2 L3 Value3
3 R1 ValueOld1
4 R2 ValueOld2
5 R3 ValueOld3

PySpark - merge two DataFrames, overwriting one with the other

You can anti join df and df2 and then union the result to df2. That way you have everything from df2 and only the rows from df which are not in df2.

df3 = df.join(df2, on=['id'], how='anti').unionAll(df2)
print(df3.head(20))

[Row(id=0, value=0, calc=0), Row(id=7, value=7, calc=14), Row(id=6, value=6, calc=12), Row(id=5, value=5, calc=10), Row(id=1, value=1, calc=2), Row(id=3, value=3, calc=6), Row(id=8, value=8, calc=16), Row(id=2, value=2, calc=4), Row(id=4, value=4, calc=8), Row(id=9, value=9, calc=81), Row(id=10, value=10, calc=100)]


Related Topics



Leave a reply



Submit