Join Two Different Tables and Remove Duplicated Entries

join two different tables and remove duplicated entries

You can use UNION clause, UNION will check for duplicates and only distinct rows will be returned

SELECT * FROM table1
UNION
SELECT * FROM Table2

Edit: To store data from both table without duplicates, do this

INSERT INTO TABLE1
SELECT * FROM TABLE2 A
WHERE NOT EXISTS (SELECT 1 FROM TABLE1 X
WHERE A.NAME = X.NAME AND
A.post_code = x.post_code)

This will insert rows from table2 that do not match name, postal code from table1

Alternative is that You can also create new table and not touch table1 and table2

CREATE TABLE TABLENAME AS
SELECT * FROM table1
UNION
SELECT * FROM Table2

Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL

If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.

I had to do something similar a few days ago so maybe you can modify my query for your purposes:

WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1

If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1 and table_2

SQL Join Two Tables Remove Duplicates

You can use Inner join which will join tables such that it selects records that have matching values in both tables, write your query as

select distinct t1.mov_id, t1.mov_name, t2.actor_name from MOVIES t1 inner join ACTOR t2 on t1.actor_id=t2.actor_id_2;

How to join two tables while removing repeated entries in one column of one table

Use DISTINCT ON:

SELECT DISTINCT ON (vcsID) *
FROM config c
JOIN data d ON d.configID = c.ID
ORDER BY vcsID, "timestamp" DESC;

Assuming you want to pick the latest row from each group of identical vcsID, thus the ORDER BY. If you really don't care which row you get for each vcsID, you don't need ORDER BY. Either way, the leading columns in ORDER BY have to match DISTINCT ON expressions, so you cannot ORDER BY c.id, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.

Detailed explanation for DISTINCT ON and alternative solutions:

  • Select first row in each GROUP BY group?

Aside: don't use basic type names like timestamp as identifiers.

remove duplicates from INNER JOIN two tables SQL

You made a sbselect of the second inner join, and the join with the subselect has no join cirteria

SELECT id, name, job_id, job_type, job_name, updated_at
FROM
(SELECT service.id,service.service_name FROM services as service
INNER JOIN positions p ON service.id = p.service_id
) As Tab1
INNER JOIN -- this join has no join cirteria
(SELECT job.job_id, job.job_type, job.job_name, job.updated_at
FROM jobs as job
INNER JOIN positions p ON job.id = p.job_id
) AS Tab2

You don't need the subselects in the first place, try somthing like

SELECT service.id, service.name, job_id, job_type, job_name, updated_at
FROM
services as service
INNER JOIN positions p ON service.id = p.service_id
INNER JOIN jobs as job ON job.id = p.job_id

how to remove duplicate in select statement with joining of two tables in sql

If a row has multiple join-partners on the other side of your join, you will always duplicate that row. You will need two separate queries to aggregate SO-Amount and INV-Amount.

--- EDIT ---

Consider this simple example: We have three tables. One that saves company departments, one that stores the annual revenue for the departments and one that stores the monthly costs of these departments.

Table 1

DepartmentId | DepartmentName          | NumberEmployees
5234 | "Software Development" | 20
3465 | "Sales" | 120

Table 2

DepartmentId | Year | Revenue
5234 | 2015 | 2,000,000
5234 | 2014 | 1,500,000

Table 3

DepartmentId | Year | Month | Cost
5234 | 2015 | Jan | 120,000
5234 | 2015 | Feb | 150,000
5234 | 2014 | Jan | 80,000

Out task is now to sum up the overall revenue of department 5234 as well as the overall costs.

If we join table 1 and table 2 we get:

DepartmentId | DepartmentName          | NumberEmployees| Year | Revenue
5234 | "Software Development" | 20 | 2015 | 2,000,000
5234 | "Software Development" | 20 | 2014 | 1,500,000

With this table we could calculate the overall revenue.

If we join table 1 and 3 we get:

DepartmentId | DepartmentName          | NumberEmployees | Year | Month | Cost
5234 | "Software Development" | 20 | 2015 | Jan | 120,000
5234 | "Software Development" | 20 | 2015 | Feb | 150,000
5234 | "Software Development" | 20 | 2014 | Jan | 80,000

With this table you can calculate the overall costs.

What you don't want to do though is joining all 3 tables, because then you get:

DepartmentId | DepartmentName          | NumberEmployees| Year | Revenue   | Month | Cost
5234 | "Software Development" | 20 | 2015 | 2,000,000 | Jan | 120,000
5234 | "Software Development" | 20 | 2015 | 2,000,000 | Feb | 150,000
5234 | "Software Development" | 20 | 2014 | 1,500,000 | Jan | 80,000

As you can see the 2015 revenue is duplicated because there are too costs entries for 2015 (Jan and Feb). If you use this table to compute both the overall revenue and the cost, you will end up with the wrong value.

So to wrap up and relate to your problem: You should use two separate queries to calculate your aggregations.

Joining two data.tables in r: removing overlap duplicates while keeping duplicates in each separate dataset

You could remove the overlaps coming from y:

l = list(dtx, dty)
dtxy = rbindlist(l, use.names = TRUE)

overlaps = merge(dtx,dty,by=c("ID","date","code"))[,.(ID,date,code,dataset = dataset.y)]

dtresultnew <- overlaps[dtxy,.(ID,date,code,x.dataset,i.dataset),on = .(ID,date,code,dataset)][
is.na(x.dataset),.(ID,date,code,dataset=i.dataset)]

identical(dtresult[order(ID,date,code)],dtresultnew[order(ID,date,code)])
[1] TRUE

Join two tables on multiple columns without duplicate rows

I would use exists:

select b.*
from b
where exists (select 1 from a where a.id = b.id1) or
exists (select 1 from a where a.id = b.id2);

In most databases, this would be the most efficient method for this type of logic. I'm not 100% sure that this is true in Hive, but it is definitely worth a try.

An alternative approach would be left joins:

select b.*
from b left join
a a1
on b.id1 = a1.id left join
a a2
on b.id2 = a2.id
where a1.id is not null or a2.id is not null;

This might have better performance in Hive, if the exists does not have good optimization.



Related Topics



Leave a reply



Submit