Merge 2 Dataframes If Value Within Range

How to merge two dataframes based on a range of possible values in the second, but keep the values of the first?

You can first full_join and fill the values with .direction = "up"

dplyr::full_join(df1, df2, by=c("Type","Month")) %>%
tidyr::fill(everything(), .direction = 'up')

# Type Month Amount Version
#1 A 1 50 1
#2 A 2 20 1
#3 A 2 40 1
#4 A 3 30 2
#5 A 3 30 2
#6 A 4 60 2
#7 A 4 60 2
#8 A 5 20 3
#9 B 1 30 1
#10 B 2 30 1
#11 B 3 20 2
#12 B 3 40 2
#13 B 4 10 2
#14 B 4 10 2
#15 B 4 30 2
#16 B 5 20 3

How to merge two dataframes based on range value of one table

An option can be to use tidyr::separate along with sqldf to join both tables on range of values.

library(sqldf)
library(tidyr)

DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")

sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")

# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan

Data:

DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)

DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

Result:

   Price  year  score
0 10 2001 20
4 70 2002 50
8 50 2010 30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0 10 2001 20
1 70 2002 50
2 50 2010 30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <-
sqldf::sqldf("SELECT a.*, b.freq
FROM df1 a
LEFT JOIN df2 b
ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
(a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2, rename the columns with m1 and m2 reversed, and then do the join.

df <-
df2 %>%
rename(m2 = m1, m1 = m2) %>%
bind_rows(df2, .) %>%
left_join(df1, ., by = c("week", "m1", "m2"))


identical(df, df3)
#> [1] TRUE

I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.

Created on 2022-02-17 by the reprex package (v2.0.1)

How to join two dataframes for which column values are within a certain range?

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :


timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3

Best way to join / merge by range in pandas

Setup

Consider the dataframes A and B

A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))

A

A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95

B

B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84

numpy

The ✌easiest✌ way is to use numpy broadcasting.

We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)

A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)

A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN

Merge two data frames considering a range match between key columns

You want to merge two data frames considering a range match between key columns. Here are two solutions.

using sqldf

library(sqldf)

output <- sqldf("select * from FD left join shpxt
on (FD.X >= shpxt.Xmin and FD.X <= shpxt.Xmax and
FD.Y >= shpxt.Ymin and FD.Y <= shpxt.Ymax ) ")

using data.table

library(data.table)

# convert your datasets in data.table
setDT(FD)
setDT(shpxt)

output <- FD[shpxt, on = .(X >= Xmin , X <= Xmax, # indicate x range
Y >= Ymin , Y <= Ymax), nomatch = NA, # indicate y range
.(Survival, X, Y, Xmin, Xmax, Ymin, Ymax, Sites )] # indicate columns in the output

There are different alternatives to solve this problem, as you will find it in other SO questions here and here.

ps. Keep in mind that for loop is not necessarily the best solution.

Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

From df2 create an auxiliary Series, marking each "starting point"
of a unit (a range of length 1):

units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)

The result is:

Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64

Then define a function generating Type for the current row:

def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)

And to generate Type column, apply it to each row:

df1['Type'] = df1.apply(getType, axis=1)

The result is:

   From    To    val                Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E

This result is a bit different from your expected result, but I think
you created it in a bit inconsequent way.

I think that my solution is correct (at least more consequent), because:

  1. Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the
    result is just A (like in your post).
  2. Row 3.0 - 5.0 can be "divided" into:

    • 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
    • 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit,
      but you want 2 units here).
  3. Row 5.0 - 7.0 can be again "divided" into:

    • 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
    • 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
  4. Row 7.0 - 10.0 can be "divided" into:

    • 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
    • 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).

Combining dataframe rows based on a value in a range

Is this what you want?

ddply(df1, .(name), function(x) {
df2[(x$Position - df2$start_position) < 100000 |
(x$Position - df2$end_position) < 100000, ]
})


Related Topics



Leave a reply



Submit