Merge data frames and overwrite values
merdat <- merge(dfrm1,dfrm2, by="Date") # seems self-documenting
# explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]
Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.
In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep()
that has the same length as sum(logical.vector)
. The goal is both instances is to have the same length (and order) for assignment as the items being replaced.
Combine dataframes and overwrite values in table 1 with all values in table 2
You can remove SN
values in x
that match SN
values in y
then row bind the the two dataframes.
rbind(x[!x$SN %in% y$SN,], y)
SN Age Name
1 1 21 John
2 2 15 Dora
3 3 44 <NA>
4 4 100 B
Merge with overwrite values from left dataframe in pandas
You will have to replace the rows to override the values in place. This is different from drop duplicates as it will change the ordering of the rows.
Combine DFs takes in "pkey" as an argument, which is the main column on which the merge should happen.
def update_df_row(row=None, col_name="", df=pd.DataFrame(), pkey=""):
try:
match_index = df.loc[df[pkey] == col_name].index[0]
row = df.loc[match_index]
except IndexError:
pass
except Exception as ex:
raise
finally:
return row
def combine_dfs(parent_df, child_df, pkey):
filtered_child_df = child_df[child_df[pkey].isin(parent_df[pkey])]
parent_df[parent_df[pkey].isin(child_df[pkey])] = parent_df[
parent_df[pkey].isin(child_df[pkey])].apply(
lambda row: update_df_row(row, row[pkey], filtered_child_df, pkey), axis=1)
parent_df = pd.concat([parent_df, child_df]).drop_duplicates([pkey])
return parent_df.reset_index(drop=True)
The output of the above code snippet will be:
A B
0 AD 1
1 CD 4
2 EF 3
3 GH 5
Merge R data frame or data table and overwrite values of multiple columns
You can do this by using dplyr::coalesce
, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce
directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x
data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
Pandas merging dataframes, overwriting values on key
Simply use pd.concat
with indexes that are not in df2
's indexes. Let
df = df.set_index('id')
df2 = df2.set_index('id')
Then
>>> merged = pd.concat([df[~df.index.isin(df2.index)], df2]).reset_index()
id value
0 100010 25
1 100011 22
2 100012 35
3 100013 36
join or merge with overwrite in pandas
How about: df2.combine_first(df1)
?
In [33]: df2
Out[33]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
In [34]: df1
Out[34]:
A B C
2000-01-03 2.288863 0.188175 -0.040928
2000-01-04 0.159107 -0.666861 -0.551628
2000-01-05 -0.356838 -0.231036 -1.211446
2000-01-06 -0.866475 1.113018 -0.001483
2000-01-07 0.303269 0.021034 0.471715
2000-01-10 1.149815 0.686696 -1.230991
2000-01-11 -1.296118 -0.172950 -0.603887
2000-01-12 -1.034574 -0.523238 0.626968
2000-01-13 -0.193280 1.857499 -0.046383
2000-01-14 -1.043492 -0.820525 0.868685
In [35]: df2.comb
df2.combine df2.combineAdd df2.combine_first df2.combineMult
In [35]: df2.combine_first(df1)
Out[35]:
A B C D
2000-01-03 0.638998 1.277361 0.193649 0.345063
2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726
2000-01-05 0.435507 -0.025162 -1.112890 0.324111
2000-01-06 -0.210756 -1.027164 0.036664 0.884715
2000-01-07 -0.821631 -0.700394 -0.706505 1.193341
2000-01-10 1.015447 -0.909930 0.027548 0.258471
2000-01-11 -0.497239 -0.979071 -0.461560 0.447598
2000-01-12 -1.034574 -0.523238 0.626968 NaN
2000-01-13 -0.193280 1.857499 -0.046383 NaN
2000-01-14 -1.043492 -0.820525 0.868685 NaN
Note that it takes the values from df1
for indices that do not overlap with df2
. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.
Pandas merge with overwrite and union together
concat
then drop_duplicates:
pd.concat((new,old)).drop_duplicates('Key1')
Or with isin
check which ids are not present in the old df and then concat:
pd.concat((new,old[~old['Key1'].isin(new['Key1'])]))
Key1 Key2
0 L1 Value1
1 L2 Value2
2 L3 Value3
3 R1 ValueOld1
4 R2 ValueOld2
5 R3 ValueOld3
PySpark - merge two DataFrames, overwriting one with the other
You can anti join df
and df2
and then union the result to df2
. That way you have everything from df2
and only the rows from df
which are not in df2
.
df3 = df.join(df2, on=['id'], how='anti').unionAll(df2)
print(df3.head(20))
[Row(id=0, value=0, calc=0), Row(id=7, value=7, calc=14), Row(id=6, value=6, calc=12), Row(id=5, value=5, calc=10), Row(id=1, value=1, calc=2), Row(id=3, value=3, calc=6), Row(id=8, value=8, calc=16), Row(id=2, value=2, calc=4), Row(id=4, value=4, calc=8), Row(id=9, value=9, calc=81), Row(id=10, value=10, calc=100)]
Related Topics
Convert Matrix to Three Column Data.Frame
Random Forest with Classes That Are Very Unbalanced
Enter New Column Names as String in Dplyr's Rename Function
Suppress Messages Displayed by "Print" Instead of "Message" or "Warning" in R
Solving for the Inverse of a Function in R
Are Recursive Functions Used in R
Can R Read from a File Through an Ssh Connection
Faster Way to Subset on Rows of a Data Frame in R
How to Self Join a Data.Table on a Condition
Highlight All Connected Paths from Start to End in Sankey Graph Using R
Population Pyramid Density Plot in R
Change Stringsasfactors Settings for Data.Frame
Dependency 'Slam' Is Not Available When Installing Tm Package
How to Get a Warning on "Shiny App Will Not Work If the Same Output Is Used Twice"
R: How to Draw a Line with Multiple Arrows in It
Diagnosing R Package Build Warning: "Latex Errors When Creating PDF Version"