How to merge two dataframes based on a range of possible values in the second, but keep the values of the first?
You can first full_join
and fill
the values with .direction = "up"
dplyr::full_join(df1, df2, by=c("Type","Month")) %>%
tidyr::fill(everything(), .direction = 'up')
# Type Month Amount Version
#1 A 1 50 1
#2 A 2 20 1
#3 A 2 40 1
#4 A 3 30 2
#5 A 3 30 2
#6 A 4 60 2
#7 A 4 60 2
#8 A 5 20 3
#9 B 1 30 1
#10 B 2 30 1
#11 B 3 20 2
#12 B 3 40 2
#13 B 4 10 2
#14 B 4 10 2
#15 B 4 30 2
#16 B 5 20 3
How to merge two dataframes based on range value of one table
An option can be to use tidyr::separate
along with sqldf
to join both tables on range of values.
library(sqldf)
library(tidyr)
DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")
sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")
# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan
Data:
DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)
how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?
Solution 1: Simple Solution for small dataset
For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the conditions where the Price is within range and year is within range using .query()
specifying the conditions, as follows:
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
Result:
Price year score
0 10 2001 20
4 70 2002 50
8 50 2010 30
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
We look for Price
in df2
is within price range in df1
and year
in df2
is within year range in df1
:
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
Result:
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
Performance Comparison
Part 1: Compare for original datasets of 3 rows each:
Solution 1:
%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution 2:
%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster
Part 2: Compare for datasets with 3,000 and 30,000 rows:
Data Setup:
df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)
Solution 1:
%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values
d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]
3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster
Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe
I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.
library(sqldf)
library(dplyr)
df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
df_sql <-
sqldf::sqldf("SELECT a.*, b.freq
FROM df1 a
LEFT JOIN df2 b
ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
(a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")
identical(df_sql, df3)
#> [1] TRUE
I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2
, rename the columns with m1
and m2
reversed, and then do the join.
df <-
df2 %>%
rename(m2 = m1, m1 = m2) %>%
bind_rows(df2, .) %>%
left_join(df1, ., by = c("week", "m1", "m2"))
identical(df, df3)
#> [1] TRUE
I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.
Created on 2022-02-17 by the reprex package (v2.0.1)
How to join two dataframes for which column values are within a certain range?
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Best way to join / merge by range in pandas
Setup
Consider the dataframes A
and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Merge two data frames considering a range match between key columns
You want to merge two data frames considering a range match between key columns. Here are two solutions.
using sqldf
library(sqldf)
output <- sqldf("select * from FD left join shpxt
on (FD.X >= shpxt.Xmin and FD.X <= shpxt.Xmax and
FD.Y >= shpxt.Ymin and FD.Y <= shpxt.Ymax ) ")
using data.table
library(data.table)
# convert your datasets in data.table
setDT(FD)
setDT(shpxt)
output <- FD[shpxt, on = .(X >= Xmin , X <= Xmax, # indicate x range
Y >= Ymin , Y <= Ymax), nomatch = NA, # indicate y range
.(Survival, X, Y, Xmin, Xmax, Ymin, Ymax, Sites )] # indicate columns in the output
There are different alternatives to solve this problem, as you will find it in other SO questions here and here.
ps. Keep in mind that for loop
is not necessarily the best solution.
Combine two dataframes based on ranges which may partially overlap using Pandas and track multiple values
From df2 create an auxiliary Series, marking each "starting point"
of a unit (a range of length 1):
units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)
The result is:
Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64
Then define a function generating Type for the current row:
def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)
And to generate Type column, apply it to each row:
df1['Type'] = df1.apply(getType, axis=1)
The result is:
From To val Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E
This result is a bit different from your expected result, but I think
you created it in a bit inconsequent way.
I think that my solution is correct (at least more consequent), because:
- Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the
result is just A (like in your post). - Row 3.0 - 5.0 can be "divided" into:
- 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
- 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit,
but you want 2 units here).
- Row 5.0 - 7.0 can be again "divided" into:
- 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
- 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
- Row 7.0 - 10.0 can be "divided" into:
- 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
- 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).
Combining dataframe rows based on a value in a range
Is this what you want?
ddply(df1, .(name), function(x) {
df2[(x$Position - df2$start_position) < 100000 |
(x$Position - df2$end_position) < 100000, ]
})
Related Topics
Create Categorical Variable in R Based on Range
Specification of First and Last Tick Marks with Scale_X_Date
How to Use a String Variable to Select a Data Frame Column Using $ Notation
Subtract a Column in a Dataframe from Many Columns in R
R Knitr Chunk Options for Figure Height/Width Are Not Working
Define $ Right Parameter with a Variable in R
R Shiny: Reactivevalues VS Reactive
Format Number as Fixed Width, with Leading Zeros
How to Add a General Label to Facets in Ggplot2
Sort Columns of a Dataframe by Column Name
How to Round Up to the Nearest 10 (Or 100 or X)
Force Character Vector Encoding from "Unknown" to "Utf-8" in R
Simplest Way to Get Rbind to Ignore Column Names
Calculate Cumulative Average (Mean)