Remove Duplicates from SQL Union

remove duplicates from sql union

Union will remove duplicates. Union All does not.

SQL UNION ALL to eliminate duplicates

But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:

SELECT * FROM mytable WHERE a=X OR b=Y

This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.

Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.

But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.

SELECT * FROM mytable WHERE a=X 
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y

UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....

We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.

But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.

If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.

SELECT * FROM mytable WHERE a=X 
UNION ALL
SELECT * FROM mytable WHERE b=Y AND a!=X

These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.

The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.

So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.

Fastest way to remove duplicates from UNION ALL (without using UNION) on PostgreSQL?

If there are many duplicates among fruits and / or veggies, but not so many between fruits and veggies (like the names in your example suggest), and since you have an index for both of them, emulating an index skip scan (a.k.a. loose index scan) will work wonders:

WITH RECURSIVE fruit AS (
(
SELECT fruit
FROM recipes
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT (SELECT fruit
FROM recipes
WHERE fruit > t.fruit
ORDER BY 1
LIMIT 1)
FROM fruit t
WHERE t.fruit IS NOT NULL
)
, veggie AS (
(
SELECT veggie
FROM recipes
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT (SELECT veggie
FROM recipes
WHERE veggie > t.veggie
ORDER BY 1
LIMIT 1)
FROM veggie t
WHERE t.veggie IS NOT NULL
)
SELECT DISTINCT healthy_food
FROM (
SELECT fruit AS healthy_food FROM fruit
UNION ALL
SELECT veggie AS healthy_food FROM veggie
) sub
WHERE healthy_food IS NOT NULL;

Just DISTINCT instead of DISTINCT ON (like you tried) in the outer SELECT, since we are dealing with a single column.

See:

  • SELECT DISTINCT is slower than expected on my table in PostgreSQL
  • Optimize GROUP BY query to retrieve latest row per user

You might as well use UNION instead of UNION ALL + DISTINCT in the outer SELECT. Only avoided that because you explicitly asked for it. But I don't see the point.

Remove duplicates after UNION in SQL

You can use the same process of removing duplicates as you have used for both tables.
It would look something like this:

WITH cteUnion AS
( SELECT *, ROW_NUMBER() OVER (PARTITION BY V1,V2,V3 ORDER BY V1) AS rn
FROM (
(WITH cte1 AS(
SELECT v1, v2, v3, v4,
row_number()over (PARTITION BY V1, V2 ORDER BY V1) rn
FROM T1)
SELECT V1, V2, V3, V4
FROM cte1 WHERE rn=1)
UNION
(WITH cte2 AS(
SELECT v1, v2, v3, v4,
row_number()over (PARTITION BY V1, V2 ORDER BY V1) rn
FROM T2)
SELECT V1, V2, V3, V4
FROM cte2 WHERE rn=1)
) as union
)
SELECT *
FROM cteUnion
WHERE rn = 1

or you can use DISTINCT if you just want columns V1,V2,V3:

SELECT DISTINCT V1,V2,V3
FROM (
(WITH cte1 AS(
SELECT v1, v2, v3, v4,
row_number()over (PARTITION BY V1, V2 ORDER BY V1) rn
FROM T1)
SELECT V1, V2, V3, V4
FROM cte1 WHERE rn=1)
UNION
(WITH cte2 AS(
SELECT v1, v2, v3, v4,
row_number()over (PARTITION BY V1, V2 ORDER BY V1) rn
FROM T2)
SELECT V1, V2, V3, V4
FROM cte2 WHERE rn=1)
) as union

How to remove duplicate using union all

IF and ONLY IF you have to use a UNION ALL otherwise I would go with Handoko Chen's solution

Select Distinct *
From (
Select * From Employee1
Union All
Select * From Employee2
) A

How to remove duplicates out of the UNION but ignore one column

You can use row_number(). If you want one row per first name (what your question implies), then:

select t.*
from (select t.*,
row_number() over (partition by firstname order by department) as seqnum
from MyTable
) t
where seqnum = 1;

This will choose "Accounting" over "Finance" for duplicates.

UNION not removing duplicates

Your resultset actually has no duplicates. Duplicates are rows where all columns have equal values, and no row in your resultset complies to that definition.

Presumably, you want aggregation in the outer query:

select column1, max(column2) column2
from (
select part as column1, null as column2 from soldparts where part like ?
union all
select part as column1, vin as column2 from vindata where part like ?
)
group by column1
order by column1 asc

Note that I modified your query to use bind parameters (?); for the sake of security and efficiency, you should learn to use parameterized query rather than concatenating variables in the query string.

Union two selects removing duplicates base on some columns (not the full row)

try

select * from table1
union
select * from table2
where not exists(select 1 from table1
where table2.parent = table1.parent
and table2.type = table1.type)

UNION does not remove duplicates in result

You can try the below way - using aggregation on top of your query

select OmrådesNr,OmrådesBenämning,max(Antal) as Antal
from
(
SELECT Cams_Global.dbo.A960.OmrådesNr, Cams_Global.dbo.A960.OmrådesBenämning, 0 AS Antal
FROM A406 INNER JOIN
Cams_Global.dbo.A960 ON A406.OmrådesNr = Cams_Global.dbo.A960.OmrådesNr
WHERE (A406.Kund IN (5566321537))
UNION
SELECT A960_1.OmrådesNr, A960_1.OmrådesBenämning, COUNT(A806.Aordernr) AS Antal
FROM A806 INNER JOIN
A406 AS A406_1 ON A806.Ställeid = A406_1.Ställeid INNER JOIN
A400 ON A806.Objektid = A400.Objektid INNER JOIN
A402 ON A400.ObjGrupp = A402.Objgrupp INNER JOIN
Cams_Global.dbo.A957 ON A806.LevId = Cams_Global.dbo.A957.LevId RIGHT OUTER JOIN
Cams_Global.dbo.A960 AS A960_1 ON A406_1.OmrådesNr = A960_1.OmrådesNr
WHERE (A806.Beställning = 0) AND (Cams_Global.dbo.A957.LevIdGrupp IN (1001, 1000)) AND
(A806.ProtokollSparad = 0) AND (A406_1.Kund = 5566321537)
GROUP BY A960_1.OmrådesBenämning, A960_1.OmrådesNr
)A group by OmrådesNr,OmrådesBenämning

Remove duplicates from union in SQL Views (Toad)

UNION eliminates duplicates between the two views across all columns. There are several reasons that could cause the duplicates on (ITEM, CODE):

  1. either the records where this tuple of columns is duplicated have different values in the other columns; since these records are not true duplicates, UNION does not eliminate them. In this case, your solution with the CTE is the correct approach.

  2. or one of the views (maybe both), already contains duplicates (whether true or partial). UNION does not removes the duplicates (even if they are true duplicates), them since they do not spread across the two queries.

In both cases, as an optimization, you could possibly skip the intermediate view by deduplicating directly the result of the UNION query. If there are true duplicates in the views, then you can simply use SELECT DISTINCT on each of the UNIONed queries. Else if you are dealing with partial duplicates, then you can wrap the entire UNION queries in a CTE and use ROW_NUMBER() to remove duplicates, as shown in your existing code.

A final thought: since you noticed that UNION actually does not remove any record, you would better use UNION ALL instead: this means less work for your RDBMS, and will make the queries that address the final view more efficient.



Related Topics



Leave a reply



Submit