How to Join Tables on Regex

How to join tables on regex

A very odd way to join. Every match on one side is combined with every row from the other table ...

regexp_matches() is probably the wrong function for your purpose. You want a simple regular expression match (~). Actually, the LIKE operator will be faster:

Presumably fastest with LIKE

SELECT msg.message
, msg.src_addr
, msg.dst_addr
, mnc.name
FROM mnc
JOIN msg ON msg.src_addr LIKE ('%38' || mnc.code || '%')
OR msg.dst_addr LIKE ('%38' || mnc.code || '%')
WHERE length(mnc.code) = 3;

In addition, you only want mnc.code of exactly 3 characters.

With regexp match

You could write the same with regular expressions but it will most definitely be slower. Here is a working example close to your original:

SELECT msg.message
, msg.src_addr
, msg.dst_addr
, mnc.name
FROM mnc
JOIN msg ON (msg.src_addr || '+' || msg.dst_addr) ~ (38 || mnc.code)
AND length(mnc.code) = 3;

This also requires msg.src_addr and msg.dst_addr to be NOT NULL.

The second query demonstrates how the additional check length(mnc.code) = 3 can go into the JOIN condition or a WHERE clause. Same effect here.

With regexp_matches()

You could make this work with regexp_matches():

SELECT msg.message
, msg.src_addr
, msg.dst_addr
, mnc.name
FROM mnc
JOIN msg ON EXISTS (
SELECT *
FROM regexp_matches(msg.src_addr ||'+'|| msg.dst_addr, '38(...)', 'g') x(y)
WHERE y[1] = mnc.code
);

But it will be slow in comparison.

Explanation:

Your regexp_matches() expression just returns an array of all captured substrings of the first match. As you only capture one substring (one pair of brackets in your pattern), you will exclusively get arrays with one element.

You get all matches with the additional "globally" switch 'g' - but in multiple rows. So you need a sub-select to test them all (or aggregate). Put that in an EXISTS - semi-join and you arrive at what you wanted.

Maybe you can report back with a performance test of all three?
Use EXPLAIN ANALYZE for that.

bigquery join tables using regex

is there any more efficient way?

If by "more efficient" you mean less verbose and easier to read / maintain - try below

#standardSQL
create temp function part(host string) as (
regexp_extract(host, r'.*req-([a-z0-9]{12})')
);
select *
from `config.tested` a
join `config.active` b
on part(a.host) = part(b.host)

or its variation

REGEX join in SQL BigQuery

With your sample data I was able to create a StandardSQL query to achieve the desired output. I have uploaded in BigQuery two tables one called chain_name and the second table store_name with the 1st and 2nd chunk of data you shared, respectively.

In order to CROSS JOIN the 2 tables analysing if brand_name is inside store_name(which may contain the the store's address in addition to the brand_name), I have used the HAVING clause, which filters out the rows that do not return TRUE when they are evaluated against a bool expression. Also, REGEX_CONTAINS() which returns true if the analysed expressions are a partial match.

Below is the query,

SELECT salesperson, chain_id, chain_name, brand_name, store_name FROM `project_id.dataset.store_name` 
CROSS JOIN `project_id.dataset.chain_name`
GROUP BY 1,2,3,4,5
HAVING REGEXP_CONTAINS(lower(store_name),lower(brand_name))
ORDER BY chain_id

And the output,

Row salesperson chain_id    chain_name  brand_name  store_name
1 pepito 28 chairs4all sitnice sitnice
2 pepito 28 chairs4all AwwwChairs! AwwwChairs! Seville
3 pepito 28 chairs4all sitnice SitNice nº12
4 jaimito 34 eatdrinkeat Drink & eat Drink & eat North road 23
5 jaimito 567 last One One 34

Notice that LOWER() is used in order to match the names even though there might be some lower/caption letters difference.

Regular expression in SQL join condition

Try like this:

Select * from TableA a
Inner join TableB b
On Right(a.Column1,3)= b.Column1

Also you can use CTE to get length of the right joined table data (say 233) which will be 3 in length and reuse this length in the RIGHT function parameter where i have used 3.

EDIT:

Without CTE i think this should work:

Select * from TableA a
Inner join TableB b
On Right(a.Column1,LEN(b.Column1))= b.Column1

SQL(ite) JOIN with regex within JOIN

How about just using the lower() function?

SELECT *, upper(substr(t1.name, 1, 1) || lower(t1.name, 2) as NameWithCapAtFront
FROM table1 as t1 LEFT JOIN
table2 as t2
ON lower(t1.Name) = lower(t2.Name);

Admittedly, this lower cases the whole name, but that seems reasonable in this case.



Related Topics



Leave a reply



Submit