Merging a Lot of Data.Frames

Python: pandas merge multiple dataframes

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

Just simply merge with DATE as the index and merge using OUTER method (to get all the data).

import pandas as pd
from functools import reduce

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

Now, basically load all the files you have as data frame into a list. And, then merge the files using merge or reduce function.

# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]

Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.

To keep the values that belong to the same date you need to merge it on the DATE

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)

# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as

df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
  • Now, the output will the values from the same date on the same lines.
  • You can fill the non existing data from different frames for different columns using fillna().

Then write the merged data to the csv file if desired.

pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)

This should give you

DATE VALUE1 VALUE2 VALUE3 ....

Merging a lot of data.frames

Put them into a list and use merge with Reduce

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
# id v1 v2 v3
# 1 1 1 NA NA
# 2 10 4 NA NA
# 3 2 3 4 NA
# 4 43 5 NA NA
# 5 73 2 NA NA
# 6 23 NA 2 1
# 7 57 NA 3 NA
# 8 62 NA 5 2
# 9 7 NA 1 NA
# 10 96 NA 6 NA

You can also use this more concise version:

Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))

Efficient way to merge multiple large DataFrames

You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.

df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)

Alternatively, you can replace the concat (second step) by an iterative join:

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)

This may or may not be better than the merge.

Best strategy for merging a lot of data frames using pandas

I would think the fastest way is to set the column you want to merge on to the index, create a list of the dataframes and then pd.concat them. Something like this:

import os
import pandas as pd
directory = os.path.expanduser('~/home')
files = os.path.listdir(directory)
dfs = []
for filename in files:
if '.tsv' in file:
df = pd.read_table(os.path.join(directory,filename),sep='\t').set_index('bird')
dfs.append(df)
master_df = pd.concat(dfs,axis=1)

Merge multiple DataFrames Pandas

Consider setting index on each data frame and then run the horizontal merge with pd.concat:

dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]

print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182

How to merge(efficient way) multiple data frame in one go?

You can use merge in Reduce:

Reduce(merge, list(df, df1, df2, df3))
# ID YEAR MONTH DAY HOUR VALUE1 VALUE2 VALUE3 VALUE4
#1 A 2020 1 16 15 1 3 6 9
#2 B 2020 1 16 15 2 4 7 10
#3 C 2020 1 16 15 3 5 8 11

Merging data frame and filling missing values

You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.

new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1

new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5

A tidyverse way with the same logic :

library(dplyr)
library(purrr)

list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))


Related Topics



Leave a reply



Submit