fuzzy join with multiple conditions
I've simplified the structures, using Series instead of DataFrame, and the indexes start at zero.
cumsum() and searchsorted() are applied.
Load = pd.Series([300,0,0,400,50,0,0,0,150,0]) # aka 'A'
Rate = pd.Series([102,103,94,120,145,114,126,117,107,100]) # aka 'B'
# Storage for the result:
H=[] # [ (indexLoad, Load, indexRate, excess) ... ]
# Find the 1st non 0 load:
load1_idx= len(Load)
for lix in range(len(Load)):
a= Load[lix]
if a!=0:
csumser= Rate.cumsum()
rix= csumser.searchsorted(a)
excess= csumser[rix]-a
H.append( (lix,a,rix,excess) )
load1_idx=lix
break
# Processing
for lix in range(load1_idx+1,len(Load)):
a=Load[lix]
if a==0:
continue
last_rix= H[-1][-2]
csumser[last_rix:]= Rate[last_rix:]
if lix==last_rix:
csumser[lix]= H[-1][-1] # excess
csumser[last_rix:]= csumser[last_rix:].cumsum()
rix= csumser[last_rix:].searchsorted(a)
rix+= last_rix
excess= csumser[rix]-a
H.append( (lix,a,rix,excess) )
df= pd.DataFrame(H, columns=["indexLoad","Load","indexRate","rate_excess"])
print(df)
indexLoad Load indexRate rate_excess
0 0 300 3 119
1 3 400 6 104
2 4 50 6 76
3 8 150 7 93
Fuzzy matching multiple columns in BigQuery when left-joining
Consider below approach
select
array_agg(b.playerId order by d limit 1) playerId,
any_value(a).*
from (
select
`bqutil.fn.levenshtein`(a.firstName, b.firstName) / greatest(length(a.firstName), length(b.firstName)) +
`bqutil.fn.levenshtein`(a.lastName, b.lastName) / greatest(length(a.lastName), length(b.lastName)) +
`bqutil.fn.levenshtein`(a.school, b.teamMarket) / greatest(length(a.school), length(b.teamMarket)) +
`bqutil.fn.levenshtein`(a.conference, b.conferenceName) / greatest(length(a.conference), length(b.conferenceName)) d,
a, b
from lhs_table as a
cross join rhs_table as b
)
group by to_json_string(a)
if applied to sample data in your question - output is
You can play/experiment with variation of above, like below as an example
select
array_agg(b.playerId order by d limit 1) playerId,
any_value(a).*
from (
select
`bqutil.fn.levenshtein`(format('%t', a), format('%t', b)) d,
a, b
from lhs_table as a
cross join rhs_table as b
)
group by to_json_string(a)
Related Topics
R - Error When Using Geturl from Curl After Site Was Changed
Show Source Code for a Function in a Package in R
How to Create a Continuous Legend (Color Bar Style) for Scale_Alpha
Creating a Cumulative Step Graph in R
Large Matrices in Rcpparmadillo via The Arma_64Bit_Word Define
Multiplication of Large Integers
Data.Table Join (Multiple) Selected Columns with New Names
How to Force Ggplot's Geom_Tile to Fill Every Facet
In R, Merge Two Data Frames, Fill Down The Blanks
Coloring a Geom_Histogram by Gradient
Mlogit: Missing Value Where True/False Needed
Spread with Duplicate Identifiers for Rows
R Package Conflict Between Gam and Mgcv