Merge 2 Dataframes If Value Within Range

How to merge two dataframes based on a range of possible values in the second, but keep the values of the first?

You can first full_join and fill the values with .direction = "up"

dplyr::full_join(df1, df2, by=c("Type","Month")) %>%
  tidyr::fill(everything(), .direction = 'up')

#   Type Month Amount Version
#1     A     1     50       1
#2     A     2     20       1
#3     A     2     40       1
#4     A     3     30       2
#5     A     3     30       2
#6     A     4     60       2
#7     A     4     60       2
#8     A     5     20       3
#9     B     1     30       1
#10    B     2     30       1
#11    B     3     20       2
#12    B     3     40       2
#13    B     4     10       2
#14    B     4     10       2
#15    B     4     30       2
#16    B     5     20       3

How to merge two dataframes based on range value of one table

An option can be to use tidyr::separate along with sqldf to join both tables on range of values.

library(sqldf)
library(tidyr)

DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")

sqldf("select DF1.*, DF2.AREA from DF1, DF2 
      WHERE DF1.SIC between DF2.Start AND DF2.End")

#   SIC Value      AREA
# 1 350   100 Education
# 2 460   500       Lan
# 3 140   200    Forest
# 4 290   400 Education
# 5 506   450       Lan

Data:

DF1 <- read.table(text =
"SIC     Value
350     100
460     500
140     200
290     400
506     450",
header = TRUE, stringsAsFactors = FALSE)

DF2 <- read.table(text =
"SIC1          AREA
100-200      Forest
201-280      Hospital
281-350      Education
351-450      Government
451-550      Lan",
header = TRUE, stringsAsFactors = FALSE)

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

Result:

   Price  year  score
0     10  2001     20
4     70  2002     50
8     50  2010     30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1a.values[j], df2a.values[i]]),
    columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <- 
  sqldf::sqldf("SELECT a.*, b.freq
               FROM df1 a
               LEFT JOIN df2 b 
               ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
                  (a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2, rename the columns with m1 and m2 reversed, and then do the join.

df <-
  df2 %>%
  rename(m2 = m1, m1 = m2) %>%
  bind_rows(df2, .) %>%
  left_join(df1, ., by = c("week", "m1", "m2"))


identical(df, df3)
#> [1] TRUE

I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.

^{Created on 2022-02-17 by the reprex package (v2.0.1)}

How to join two dataframes for which column values are within a certain range?

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :


            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Best way to join / merge by range in pandas

Setup

Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy

The ✌easiest✌ way is to use numpy broadcasting.

We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

Merge two data frames considering a range match between key columns

You want to merge two data frames considering a range match between key columns. Here are two solutions.

using `sqldf`

library(sqldf)

output <- sqldf("select * from FD left join shpxt 
                on (FD.X >= shpxt.Xmin and FD.X <= shpxt.Xmax and
                    FD.Y >= shpxt.Ymin and FD.Y <= shpxt.Ymax ) ")

using `data.table`

library(data.table)

# convert your datasets in data.table
  setDT(FD) 
  setDT(shpxt)

output <- FD[shpxt, on = .(X >= Xmin , X <= Xmax,                # indicate x range
                           Y >= Ymin , Y <= Ymax), nomatch = NA, # indicate y range
             .(Survival, X, Y, Xmin, Xmax, Ymin, Ymax, Sites )]  # indicate columns in the output

There are different alternatives to solve this problem, as you will find it in other SO questions here and here.

ps. Keep in mind that for loop is not necessarily the best solution.

Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

From df2 create an auxiliary Series, marking each "starting point"
of a unit (a range of length 1):

units = df2.set_index('Type').apply(lambda row: pd.Series(
    range(row.From, row.To)), axis=1).stack()\
    .reset_index(level=1, drop=True)

The result is:

Type
A    0.0
A    1.0
A    2.0
A    3.0
B    4.0
C    5.0
D    6.0
D    7.0
E    8.0
E    9.0
dtype: float64

Then define a function generating Type for the current row:

def getType(row):
    gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]
    txt = []
    for key, grp in gr:
        siz = grp.size
        un = 'unit' if siz == 1 else 'units'
        txt.append(f'{siz} {un} {key}')
    return ','.join(txt)

And to generate Type column, apply it to each row:

df1['Type'] = df1.apply(getType, axis=1)

The result is:

   From    To    val                Type
0   1.0   3.0  0.001                   A
1   3.0   5.0  0.005   1 unit A,1 unit B
2   5.0   7.0  0.002   1 unit C,1 unit D
3   7.0  10.0  0.001  1 unit D,2 units E

This result is a bit different from your expected result, but I think
you created it in a bit inconsequent way.

I think that my solution is correct (at least more consequent), because:

Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the
result is just A (like in your post).
Row 3.0 - 5.0 can be "divided" into:
- 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
- 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit,
  but you want 2 units here).
Row 5.0 - 7.0 can be again "divided" into:
- 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
- 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
Row 7.0 - 10.0 can be "divided" into:
- 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
- 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).

Combining dataframe rows based on a value in a range

Is this what you want?

ddply(df1, .(name), function(x) {
 df2[(x$Position - df2$start_position) < 100000 | 
     (x$Position - df2$end_position) < 100000, ]
})

Merge 2 Dataframes If Value Within Range

How to merge two dataframes based on a range of possible values in the second, but keep the values of the first?

How to merge two dataframes based on range value of one table

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

Solution 2: Numpy Solution for large dataset

Performance Comparison

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

How to join two dataframes for which column values are within a certain range?

Best way to join / merge by range in pandas

Merge two data frames considering a range match between key columns

using `sqldf`

using `data.table`

Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

Combining dataframe rows based on a value in a range

Related Topics

Leave a reply

How to merge two dataframes based on a range of possible values in the second, but keep the values of the first?

How to merge two dataframes based on range value of one table

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

Solution 2: Numpy Solution for large dataset

Performance Comparison

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

How to join two dataframes for which column values are within a certain range?

Best way to join / merge by range in pandas

Merge two data frames considering a range match between key columns

using sqldf

using data.table

Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values

Combining dataframe rows based on a value in a range

Related Topics

Leave a reply

using `sqldf`

using `data.table`