Group or Distinct After Join Returns Duplicates

GROUP or DISTINCT after JOIN returns duplicates

While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

SELECT *
FROM products p
JOIN (
SELECT DISTINCT ON (product_id) *
FROM meta
ORDER BY product_id, id DESC
) m ON m.product_id = p.id;

The more rows in meta per row in products, the bigger the impact on performance.

Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

SQL Fiddle.

Details for DISTINCT ON:

  • Select first row in each GROUP BY group?

Optimize performance

Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM (
SELECT product_id, max(id) AS meta_id
FROM meta
GROUP BY 1
) sub
JOIN meta m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;

If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

If performance is your paramount requirement, consider more options:

  • a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
  • a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).

    This is the only way I know to use an index for a DISTINCT query over the whole table.

Finding Duplicates: GROUP BY and DISTINCT giving different answers

Your counting logic is off, and mine was too, until I came up with a simple example to better understand your question. Imagine a simple table with only one column, text:

text
----
A
B
B
C
C
C

Running SELECT COUNT(*) just yields 6 records, as expected. SELECT DISTINCT text returns 3 records, for A,B,C. Finally, SELECT text with HAVING COUNT(*) > 1 returns only two records, for the B and C groups.

None of these numbers add up at all. The issue here is that a distinct select also returns records which are not duplicate, in addition to records which are duplicate. Also, a given duplicate record could occur more than two times. Your current comparison is somewhat apples to oranges.

Edit:

If you want to remove all duplicates in your six-column table, leaving only one distinct record from all columns, then try using a deletable CTE:

WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ScanNumber, DB_ID, PluginID,
PluginID_Version, Result, ActualValue
ORDER BY (SELECT NULL)) rn
FROM DBAScanResults
)

DELETE
FROM cte
WHERE rn > 1;

SQL Server DISTINCT and GROUP BY still returning duplicates (subquery format)

If you only want 1 image per site and it doesn't matter which image, you can use a cross apply instead of an inner join

SELECT 
a.*
, b.Bytes
FROM
(
SELECT DISTINCT
a_inner.Number
, a_inner.Latitude
, a_inner.Longitude
, b_inner.RetiredOn
, a_inner.Name
, a_inner.Zipcode
, b_inner.Oid
FROM
"AM-Martin".dbo.CpCore_Site a_inner
INNER JOIN "AM-Martin".dbo.CpSm_Face b_inner on b_inner.SiteId = a_inner.Oid
WHERE
b_inner.RetiredOn LIKE '%9999%'
AND b_inner.Number LIKE N'%LA%' OR b_inner.Number LIKE N'%LC%' OR b_inner.Number LIKE N'%BH%'
AND b_inner.Latitude > 0.0
) AS a
CROSS APPLY(SELECT TOP 1
Bytes
FROM "AM-Martin_bin".dbo.CpCore_Image b
WHERE a.Oid = b.OwnerId) b;

You can alter which image you pick for a site with multiple images by adding an "order by" in the cross apply sub query so that the image you want is first, or by adding an additional filter.

Duplicates with INNER JOIN even with DISTINCT

The problem was kinda easy to resolve. I added WHERE and the duplicates are gone.

SELECT * FROM fotos INNER JOIN occasions ON occasions.Internnummer = fotos.Internnummer WHERE VolgNr = 0 

Mysql DISTINCT still return duplicate value

You need DISTINCT inside GROUP_CONCAT() only:

SELECT u.name,
GROUP_CONCAT(DISTINCT f.name) AS friends
................................................

Note that SELECT DISTINCT ... does not make sense in your query because you are using GROUP BY which returns distinct rows for each user.

See the demo.

MySQL many-to-many JOIN returning duplicates

To follow on from the comments, for performance, it's necessary to use a distinct in your query, try:

SELECT DISTINCT Name FROM users
INNER JOIN users_jobs
ON users_jobs.user_id = users.id
WHERE users.gender = 'male'

If you're looking to get all the columns but keep the id's distinct you can use a GROUP BY, try:

SELECT * FROM users
INNER JOIN users_jobs
ON users_jobs.user_id = users.id
WHERE users.gender = 'male'
GROUP BY users.id

Although this will also effect performance, it depends on what you prioritize the most.

Using GROUP BY or DISTINCT with a LEFT JOIN

Your detail query -- the query returning every row, rather than the deduplicated version with DISTINCT or GROUP BY -- is finding more than row in users matching each row in orders. So, it is dutifully returning all those rows.

To solve your problem correctly you need to figure out why there are multiple users rows for each order. That is, for some values of order.user_id there are multiple values of users.id.

That seems a little strange to me, but I do not understand your data model. You probably need to get to investigate this data anomaly. A conventional schema would have each user able to place multiple orders, but each order relating to only one user. In that schema this query would yield one row per order but still include users with no orders:

SELECT u.id AS user_id, o.id AS order_id
FROM users AS u
LEFT JOIN orders AS o ON o.user_id = u.id

Could it be that is what you want?

Contrary to some peoples' belief, GROUP BY orders.id and SELECT DISTINCT orders.id, users.id are not the same thing. In fact, your proposed use of GROUP BY misuses the notorious MySQL extension to GROUP BY. Standard SQL will reject your GROUP BY. It will only accept GROUP BY orders.id, users.id, which is indeed equivalent to DISTINCT.

Is SELECT DISTINCT bad practice when you know a join will produce duplicates?

I don't know if there is necessarily a best practice in this case. Typically, you always want your join semantics to match the logical representation of your data and the desired result. In other words, if logically the join criteria should be across multiple columns because this is how part descriptions logically relate to a part, then that is what you should do.

Otherwise, I always default to what is more readable from a query perspective and how the query will perform. Placing the DISTINCT inside a sub-query results in a somewhat bloated query for others to try and understand. Also, the DBMS might even perform worse from a performance perspective with the sub-query approach since it could cause indexes to be eliminated).

Obviously, I don't understand your data model, but each of these queries could actually produce different results depending on how a part description is logically associated to a part.

It appears that a part description is associated to a part based on both a part number and bin number (this fact is based on the structure of the PART_DETAIL table). However, in this specific case, all part numbers (but with different bin numbers) happen to have the same part description.

What if the descriptions were not the same (e.g. different for each part number/bin number combination). Then only your more specific join would return the correct results (e.g. the part description for a specific part). So again, it all goes back to writing your query such that the logic used matches the logical representation of your data and the result set you are looking for.

This is just my 2 cents.



Related Topics



Leave a reply



Submit