How to Select Similar Sets in SQL

How to select similar sets in SQL

You specify

How can I write a query that can select all orders that are at least
85% similar to a specific order?

This is an important simplification compared with 'all pairs of orders
that are at least 85% similar to each other'.

We'll use some TDQD (Test-Driven Query Design) and some analysis to help us.

Preliminaries

To be remotely similar, the two orders must have at least one item in
common. This query can be used to determine which orders have at least
one item in common with a specified order:

SELECT DISTINCT I1.OrderID AS ID
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>

This prunes the list of other orders to be examined quite a lot, though
if the specified order included one of your most popular items, it's
likely that a lot of other orders also did so.

Instead of the DISTINCT, you could use:

SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID

This gives you the number of items in an order that it has in common
with the specified order. We also need the number of items in each
order:

SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID;

Identical Orders

For 100% similarity, the two orders would have as many items in common
as each has items. This would probably not find many pairs of orders,
though. We can find the orders with exactly the same items as the
specified order easily enough:

SELECT L1.ID
FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID
) AS L1
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;

Edit: This turns out not to be stringent enough; for the orders to be identical, the number of items in the specified order must also be the same as the number in common:

SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total
FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID
) AS L1
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common
JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS L3 ON L2.Num_Common = L3.Num_Total;

Similar Orders — Analyzing the Formula

Applying the Jaccard Similarity
as defined at Wikipedia to two orders A and B, with |A| being the count
of the number of items in order A, the Jaccard Similarity J(A,B) =
|A∩B| ÷ |A∪B|
, where |A∩B| is the number of items in common to
the two orders and |A∪B| is the total number of different items
ordered.

To meet an 85% Jaccard Similarity criterion, if the number of items in
either order is less than some threshold, the orders must be identical.
For example, if both orders A and B have 5 items, say, but there's one
item different between the two, it gives you 4 items in common (|A∩B|)
and 6 items in total (|A∪B|), so the Jaccard Similarity J(A,B) is only
66⅔%.

For 85% similarity when there are N items in each of the two orders and
1 item is different, (N-1) ÷ (N+1) ≥ 0.85, which means N > 12
(12⅓ to be precise). For a fraction F = J(A,B), one item different
means (N-1) ÷ (N+1) ≥ F which can be solved for N giving N ≥ (1
+ F) ÷ (1 - F)
. As the similarity requirement goes up, the orders
must be identical for increasingly large values of N.

Generalizing still further, let's suppose we have different size orders
with N and M items (without loss of generality, N < M). The maximum
value of |A∩B| is now N and the minimum value of |A∪B| is M (meaning
all the items in the smaller order appear in the larger order). Let's
define that M = N + ∆, and that there are ∂ items present in the
smaller order that are not present in the larger order. It follows that
there are ∆+∂ items present in the larger order that are not in the
smaller order.

By definition, then, |A∩B| = N-∂, and |A∪B| = (N-∂) + ∂ +
(N+∆-(N-∂)), where the three added terms represent (1) the number of
items in common between the two orders, (2) the number of items only in
the smaller order, and (3) the number of items only in the larger order.
This simplifies to: |A∪B| = N+∆+∂.


Key Equation

For a similarity fraction F, we're interested in pairs of orders where
J(A,B) ≥ F, so:

(N-∂) ÷ (N+∆+∂) ≥ F

F ≤ (N-∂) ÷ (N+∆+∂)


We can use a spreadsheet to graph the relationship between these. For a
given number of items in the smaller order (x-axis), and for a given
similarity, we can graph the maximum value of ∂ that gives us a
similarity of F. The formula is:

∂ = (N(1-F) - F∆) ÷ (1+F)

...plot of ∂ = (N(1-F) - F∆) ÷ (1+F)...

This is a linear equation in N and ∆ for constant F; it is non-linear
for different values of F. Clearly, ∂ has to be a non-negative
integer.

Given F = 0.85, for orders that are the same size (∆=0), for 1 ≤ N <
13, ∂ = 0; for 13 ≤ N < 25, ∂ ≤ 1; for 25 ≤ N < 37, ∂ ≤ 2,
for 37 ≤ N < 50, ∂ ≤ 3.

For orders that differ by 1 (∆=1), for 1 ≤ N < 18, ∂ = 0; for 18
≤ N < 31, ∂ ≤ 1; for 31 ≤ N < 43, ∂ ≤ 2; etc. If ∆=6, you
need N=47 before the orders are still 85% similar with ∂=1. That
means the small order has 47 items, of which 46 are in common with the
large order of 53 items.

Similar Orders — Applying the Analysis

So far, so good. How can we apply that theory to selecting the orders
similar to a specified order?

First, we observe that the specified order could be the same size as a
similar order, or larger, or smaller. This complicates things a bit.

The parameters of the equation above are:

  • N – number of items in smaller order
  • ∆ — difference between number of items in larger order and N
  • F — fixed
  • ∂ — number of items in smaller order not matched in larger order

The values available using minor variations on the queries developed at the top:

  • NC — number of items in common
  • NA — number of items in specified order
  • NB — number of items in compared order

Corresponding queries:

SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID;

SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID;

SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID

For convenience, we want the values N and N+∆ (and hence ∆) available, so
we can use a UNION to arrange things appropriately, with:

  • NS = N — number of items in smaller order
  • NL = N + ∆ — number of items in larger order

and in the second version of the UNION query, with:

  • NC = N - ∂ — number of items in common

Both queries keep the two order ID numbers so that you can track back to
the rest of the order information later.

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB

This gives us a table expression with columns OrderID_1, NS, OrderID_2,
NL, where NS is the number of items in the 'smaller order and NL is the
number of items in the larger order. Since there is no overlap in the
order numbers generated by the v1 and v2 table expressions, there's no
need to worry about 'reflexive' entries where the OrderID values are the
same. Adding NC to this is most easily handled in the UNION query too:

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v1.ID

This gives us a table expression with columns OrderID_1, NS, OrderID_2,
NL, NC, where NS is the number of items in the 'smaller order and NL is
the number of items in the larger order, and NC is the number of items
in common.

Given NS, NL, NC, we are looking for orders that satisfy:

(N-∂) ÷ (N+∆+∂) ≥ F.

  • N – number of items in smaller order
  • ∆ — difference between number of items in larger order and N
  • F — fixed
  • ∂ — number of items in smaller order not matched in larger order

  • NS = N — number of items in smaller order

  • NL = N + ∆ — number of items in larger order
  • NC = N - ∂ — number of items in common

The condition, therefore, needs to be:

NC / (NL + (NS - NC)) ≥ F

The term on the LHS must be evaluated as a floating point number, not as
an integer expression. Applying that to the UNION query above, yields:

SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

You might observe that this query only uses the OrderItem table; the
Order and Item tables are not needed.


Warning: partially tested SQL (caveat lector). The SQL above now seems to produce plausible answers on minuscule data sets. I adjusted the similarity requirement (0.25, then 0.55) and got plausible values and appropriate selectivity. However, my test data had but 8 items in the biggest order, and certainly wasn't covering the full scope of the described data. Since the DBMS I use most frequently does not support CTEs, the SQL below is untested. However, I am moderately confident that unless I made a big mistake, the CTE code in version 1 (with lots of repetition of the specified order ID) should be clean. I think version 2 may be OK too, but...it is untested.

There may be more compact ways of expressing the query, possibly using
the OLAP functions.

If I was going to test this, I'd create a table with a few
representative sets of order items, checking that the similarity measure
returned was sensible. I'd work the queries more or less as shown,
gradually building up the complex query. If one of the expressions was
shown to be flawed, then I'd make appropriate adjustments in that query
until the flaw was fixed.

Clearly, performance will be an issue. The innermost queries are not
dreadfully complex, but they aren't wholy trivial. However, measurement
will show whether it's a dramatic problem or just a nuisance. Studying
the query plans may help. It seems very probable that there should be
an index on OrderItem.OrderID; the queries are unlikely to perform well
if there isn't such an index. That is unlikely to be a problem since it
is a foreign key column.

You might get some benefit out of using 'WITH clauses' (Common Table Expressions). They would make explicit the repetition that is implicit in the two halves of the UNION sub-query.


Using Common Table Expressions

Using common table expressions clarifies to the optimizer when
expressions are the same, and may help it perform better. They also
help the humans reading your query. The query above does rather beg for
the use of CTEs.

Version 1: Repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
),
OO AS (SELECT OrderID AS ID, COUNT(*) AS NB -- Other orders (OO)
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
),
CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI)
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
)
SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA <= v2.NB
JOIN CI AS v3 ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA > v2.NB
JOIN CI AS v3 ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Version 2: Avoiding repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
),
OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB -- Other orders (OO)
FROM OrderItem AS OI
JOIN SO ON OI.OrderID != SO.ID
GROUP BY OI.OrderID
),
CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI)
FROM OrderItem AS I1
JOIN SO AS S1 ON I1.OrderID != S1.ID
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID
JOIN SO AS S2 ON I2.OrderID = S2.ID
GROUP BY I1.OrderID
)
SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA <= v2.NB
JOIN CI AS v3 ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA > v2.NB
JOIN CI AS v3 ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Neither of these is an easy read; both are easier than the big SELECT with the CTEs written out in full.


Minimal test data

This is inadequate for good testing. It gives a small modicum of confidence (and it did show up the problem with the 'identical order' query.

CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE Item (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE OrderItem
(
OrderID INTEGER NOT NULL REFERENCES Order,
ItemID INTEGER NOT NULL REFERENCES Item,
Quantity DECIMAL(8,2) NOT NULL
);

INSERT INTO Order VALUES(1);
INSERT INTO Order VALUES(2);
INSERT INTO Order VALUES(3);
INSERT INTO Order VALUES(4);
INSERT INTO Order VALUES(5);
INSERT INTO Order VALUES(6);
INSERT INTO Order VALUES(7);

INSERT INTO Item VALUES(111);
INSERT INTO Item VALUES(222);
INSERT INTO Item VALUES(333);
INSERT INTO Item VALUES(444);
INSERT INTO Item VALUES(555);
INSERT INTO Item VALUES(666);
INSERT INTO Item VALUES(777);
INSERT INTO Item VALUES(888);
INSERT INTO Item VALUES(999);

INSERT INTO OrderItem VALUES(1, 111, 1);
INSERT INTO OrderItem VALUES(1, 222, 1);
INSERT INTO OrderItem VALUES(1, 333, 1);
INSERT INTO OrderItem VALUES(1, 555, 1);

INSERT INTO OrderItem VALUES(2, 111, 1);
INSERT INTO OrderItem VALUES(2, 222, 1);
INSERT INTO OrderItem VALUES(2, 333, 1);
INSERT INTO OrderItem VALUES(2, 555, 1);

INSERT INTO OrderItem VALUES(3, 111, 1);
INSERT INTO OrderItem VALUES(3, 222, 1);
INSERT INTO OrderItem VALUES(3, 333, 1);
INSERT INTO OrderItem VALUES(3, 444, 1);
INSERT INTO OrderItem VALUES(3, 555, 1);
INSERT INTO OrderItem VALUES(3, 666, 1);

INSERT INTO OrderItem VALUES(4, 111, 1);
INSERT INTO OrderItem VALUES(4, 222, 1);
INSERT INTO OrderItem VALUES(4, 333, 1);
INSERT INTO OrderItem VALUES(4, 444, 1);
INSERT INTO OrderItem VALUES(4, 555, 1);
INSERT INTO OrderItem VALUES(4, 777, 1);

INSERT INTO OrderItem VALUES(5, 111, 1);
INSERT INTO OrderItem VALUES(5, 222, 1);
INSERT INTO OrderItem VALUES(5, 333, 1);
INSERT INTO OrderItem VALUES(5, 444, 1);
INSERT INTO OrderItem VALUES(5, 555, 1);
INSERT INTO OrderItem VALUES(5, 777, 1);
INSERT INTO OrderItem VALUES(5, 999, 1);

INSERT INTO OrderItem VALUES(6, 111, 1);
INSERT INTO OrderItem VALUES(6, 222, 1);
INSERT INTO OrderItem VALUES(6, 333, 1);
INSERT INTO OrderItem VALUES(6, 444, 1);
INSERT INTO OrderItem VALUES(6, 555, 1);
INSERT INTO OrderItem VALUES(6, 777, 1);
INSERT INTO OrderItem VALUES(6, 888, 1);
INSERT INTO OrderItem VALUES(6, 999, 1);

INSERT INTO OrderItem VALUES(7, 111, 1);
INSERT INTO OrderItem VALUES(7, 222, 1);
INSERT INTO OrderItem VALUES(7, 333, 1);
INSERT INTO OrderItem VALUES(7, 444, 1);
INSERT INTO OrderItem VALUES(7, 555, 1);
INSERT INTO OrderItem VALUES(7, 777, 1);
INSERT INTO OrderItem VALUES(7, 888, 1);
INSERT INTO OrderItem VALUES(7, 999, 1);
INSERT INTO OrderItem VALUES(7, 666, 1);

How to select 4 equally sized result sets from database table

You should look at the NTILE window function - it partitions a set of rows into chunks (any number of them - you decide) and allows you to easily pick the one you need:

WITH ChunkedData AS
(
SELECT
Chunk = NTILE(4) OVER (ORDER BY MyField ASC),
*
FROM MyTable
)
SELECT *
FROM ChunkedData
WHERE Chunk = 1

With the NTILE(4) window function, you basically get all your rows labelled with a 1, 2, 3 or 4 - 4 almost equal chunks of data. Pick the one you need - works like a charm!

And of course, if you need to, you can use other number of chunks - NTILE(10) gives you 10 equally sized chunks - your pick.

Select rows with most similar set of attributes

Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):

SELECT *
FROM photos p
WHERE (
SELECT count(*) >= 1 -- k other photos
FROM (
SELECT 1
FROM tags t1
JOIN tags t2 USING (tag)
WHERE t1.photo_id = p.id
AND t1.user_id = p.user_id
AND (t2.photo_id <> p.id OR
t2.user_id <> p.user_id)
GROUP BY t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) t1
);

Or:

SELECT *
FROM (
SELECT id, user_id
FROM (
SELECT t1.photo_id AS id, t1.user_id
FROM tags t1
JOIN tags t2 USING (tag)
WHERE (t2.photo_id <> t1.photo_id OR
t2.user_id <> t1.user_id)
GROUP BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) sub1
GROUP BY 1, 2
HAVING count(*) >= 1 -- k other photos
) sub2
JOIN photos p USING (id, user_id);

In Postgres 9.3 or later you could use a correlated subquery with a LATERAL join ...
The above are probably even faster than my first query:

SELECT *
FROM (
SELECT photo_id, user_id
FROM tags t
GROUP BY 1, 2
HAVING (
SELECT count(*) >= 1
FROM (
SELECT photo_id, user_id
FROM tags
WHERE tag = ANY(array_agg(t.tag))
AND (photo_id <> t.photo_id OR
user_id <> t.user_id)
GROUP BY 1, 2
HAVING count(*) >= 2
) t1
)
) t
JOIN photos p ON p.id = t.photo_id
AND p.user_id = t.user_id;

SQL Fiddle showing both on Postgres 9.3.

The 1st query just needs the right basic indexes.

For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...

Related:

  • Order result by count of common array elements

Proper design

It would be much more efficient to have a single column serial PK for photos and only store IDs of tags per photo ...:

CREATE TABLE  photo (
photo_id serial PRIMARY KEY
, user_id int NOT NULL
);

CREATE TABLE tag (
tag_id serial PRIMARY KEY
, tag text UNIQUE NOT NULL
);

CREATE TABLE photo_tag (
photo_id int REFERENCES (photo)
, tag_id int REFERENCES (tag)
, PRIMARY KEY (photo_id, tag_id)
);

Would make the query much simpler and faster, too.

  • How to implement a many-to-many relationship in PostgreSQL?

SQL- Selecting the most similar product

Try this

SELECT 
a_product_id,
COALESCE( b_product_id, 'no_matchs_found' ) AS closest_product_match
FROM (
SELECT
*,
@row_num := IF(@prev_value=A_product_id,@row_num+1,1) AS row_num,
@prev_value := a_product_id
FROM
(SELECT @prev_value := 0) r
JOIN (
SELECT
a.product_id as a_product_id,
b.product_id as b_product_id,
count( distinct b.Attributes ),
count( distinct b2.Attributes ) as total_products
FROM
products a
LEFT JOIN products b ON ( a.Attributes = b.Attributes AND a.product_id <> b.product_id )
LEFT JOIN products b2 ON ( b2.product_id = b.product_id )
/*WHERE */
/* a.product_id = 3 */
GROUP BY
a.product_id,
b.product_id
ORDER BY
1, 3 desc, 4
) t
) t2
WHERE
row_num = 1

The above query gets the closest matches for all the products, you can include the product_id in the innermost query, to get the results for a particular product_id, I have used LEFT JOIN so that even if a product has no matches, its displayed

SQLFIDDLE

Hope this helps

Select rows with same id but different value in another column

This ought to do it:

SELECT *
FROM YourTable
WHERE ARIDNR IN (
SELECT ARIDNR
FROM YourTable
GROUP BY ARIDNR
HAVING COUNT(*) > 1
)

The idea is to use the inner query to identify the records which have a ARIDNR value that occurs 1+ times in the data, then get all columns from the same table based on that set of values.

SET versus SELECT when assigning variables?

Quote, which summarizes from this article:

  1. SET is the ANSI standard for variable assignment, SELECT is not.
  2. SET can only assign one variable at a time, SELECT can make multiple assignments at once.
  3. If assigning from a query, SET can only assign a scalar value. If the query returns multiple values/rows then SET will raise an error. SELECT will assign one of the values to the variable and hide the fact that multiple values were returned (so you'd likely never know why something was going wrong elsewhere - have fun troubleshooting that one)
  4. When assigning from a query if there is no value returned then SET will assign NULL, where SELECT will not make the assignment at all (so the variable will not be changed from its previous value)
  5. As far as speed differences - there are no direct differences between SET and SELECT. However SELECT's ability to make multiple assignments in one shot does give it a slight speed advantage over SET.

How to select only one set of rows from a group by result table?

Use DISTINCT ON:

SELECT DISTINCT ON (name) name, working, office, sum(hour) as hours
FROM table1 t1 JOIN
table2 t2
USING (a) JOIN
table3 t3
USING (x, y, z) JOIN
table4 tf
USING (b, c)
GROUP BY name, office, working
ORDER BY name, sum(hour) DESC;

DISTINCT ON is a convenient Postgres extension. It returns one row per keys in select. The one row is determined by the ORDER BY.

Is there a combination of LIKE and IN in SQL?

There is no combination of LIKE & IN in SQL, much less in TSQL (SQL Server) or PLSQL (Oracle). Part of the reason for that is because Full Text Search (FTS) is the recommended alternative.

Both Oracle and SQL Server FTS implementations support the CONTAINS keyword, but the syntax is still slightly different:

Oracle:

WHERE CONTAINS(t.something, 'bla OR foo OR batz', 1) > 0

SQL Server:

WHERE CONTAINS(t.something, '"bla*" OR "foo*" OR "batz*"')

The column you are querying must be full-text indexed.

Reference:

  • Building Full-Text Search Applications with Oracle Text
  • Understanding SQL Server Full-Text


Related Topics



Leave a reply



Submit