How to Fuzzy Join Based on Multiple Columns and Conditions

fuzzy join with multiple conditions

I've simplified the structures, using Series instead of DataFrame, and the indexes start at zero.
cumsum() and searchsorted() are applied.

Load = pd.Series([300,0,0,400,50,0,0,0,150,0])  # aka 'A'
Rate = pd.Series([102,103,94,120,145,114,126,117,107,100]) # aka 'B'

# Storage for the result:
H=[] # [ (indexLoad, Load, indexRate, excess) ... ]

# Find the 1st non 0 load:
load1_idx= len(Load)

for lix in range(len(Load)):
a= Load[lix]
if a!=0:
csumser= Rate.cumsum()
rix= csumser.searchsorted(a)
excess= csumser[rix]-a
H.append( (lix,a,rix,excess) )
load1_idx=lix
break

# Processing
for lix in range(load1_idx+1,len(Load)):

a=Load[lix]
if a==0:
continue

last_rix= H[-1][-2]
csumser[last_rix:]= Rate[last_rix:]
if lix==last_rix:
csumser[lix]= H[-1][-1] # excess

csumser[last_rix:]= csumser[last_rix:].cumsum()

rix= csumser[last_rix:].searchsorted(a)
rix+= last_rix
excess= csumser[rix]-a
H.append( (lix,a,rix,excess) )

df= pd.DataFrame(H, columns=["indexLoad","Load","indexRate","rate_excess"])
print(df)

indexLoad Load indexRate rate_excess
0 0 300 3 119
1 3 400 6 104
2 4 50 6 76
3 8 150 7 93

Fuzzy matching multiple columns in BigQuery when left-joining

Consider below approach

select 
array_agg(b.playerId order by d limit 1) playerId,
any_value(a).*
from (
select
`bqutil.fn.levenshtein`(a.firstName, b.firstName) / greatest(length(a.firstName), length(b.firstName)) +
`bqutil.fn.levenshtein`(a.lastName, b.lastName) / greatest(length(a.lastName), length(b.lastName)) +
`bqutil.fn.levenshtein`(a.school, b.teamMarket) / greatest(length(a.school), length(b.teamMarket)) +
`bqutil.fn.levenshtein`(a.conference, b.conferenceName) / greatest(length(a.conference), length(b.conferenceName)) d,
a, b
from lhs_table as a
cross join rhs_table as b
)
group by to_json_string(a)

if applied to sample data in your question - output is

Sample Image

You can play/experiment with variation of above, like below as an example

select 
array_agg(b.playerId order by d limit 1) playerId,
any_value(a).*
from (
select
`bqutil.fn.levenshtein`(format('%t', a), format('%t', b)) d,
a, b
from lhs_table as a
cross join rhs_table as b
)
group by to_json_string(a)


Related Topics



Leave a reply



Submit