join two different tables and remove duplicated entries
You can use UNION
clause, UNION
will check for duplicates and only distinct rows will be returned
SELECT * FROM table1
UNION
SELECT * FROM Table2
Edit: To store data from both table without duplicates, do this
INSERT INTO TABLE1
SELECT * FROM TABLE2 A
WHERE NOT EXISTS (SELECT 1 FROM TABLE1 X
WHERE A.NAME = X.NAME AND
A.post_code = x.post_code)
This will insert rows from table2 that do not match name, postal code from table1
Alternative is that You can also create new table and not touch table1 and table2
CREATE TABLE TABLENAME AS
SELECT * FROM table1
UNION
SELECT * FROM Table2
Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL
If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.
I had to do something similar a few days ago so maybe you can modify my query for your purposes:
WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1
If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1
and table_2
SQL Join Two Tables Remove Duplicates
You can use Inner join which will join tables such that it selects records that have matching values in both tables, write your query as
select distinct t1.mov_id, t1.mov_name, t2.actor_name from MOVIES t1 inner join ACTOR t2 on t1.actor_id=t2.actor_id_2;
How to join two tables while removing repeated entries in one column of one table
Use DISTINCT ON
:
SELECT DISTINCT ON (vcsID) *
FROM config c
JOIN data d ON d.configID = c.ID
ORDER BY vcsID, "timestamp" DESC;
Assuming you want to pick the latest row from each group of identical vcsID
, thus the ORDER BY
. If you really don't care which row you get for each vcsID
, you don't need ORDER BY
. Either way, the leading columns in ORDER BY
have to match DISTINCT ON
expressions, so you cannot ORDER BY c.id
, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.
Detailed explanation for DISTINCT ON
and alternative solutions:
- Select first row in each GROUP BY group?
Aside: don't use basic type names like timestamp
as identifiers.
remove duplicates from INNER JOIN two tables SQL
You made a sbselect of the second inner join, and the join with the subselect has no join cirteria
SELECT id, name, job_id, job_type, job_name, updated_at
FROM
(SELECT service.id,service.service_name FROM services as service
INNER JOIN positions p ON service.id = p.service_id
) As Tab1
INNER JOIN -- this join has no join cirteria
(SELECT job.job_id, job.job_type, job.job_name, job.updated_at
FROM jobs as job
INNER JOIN positions p ON job.id = p.job_id
) AS Tab2
You don't need the subselects in the first place, try somthing like
SELECT service.id, service.name, job_id, job_type, job_name, updated_at
FROM
services as service
INNER JOIN positions p ON service.id = p.service_id
INNER JOIN jobs as job ON job.id = p.job_id
how to remove duplicate in select statement with joining of two tables in sql
If a row has multiple join-partners on the other side of your join, you will always duplicate that row. You will need two separate queries to aggregate SO-Amount
and INV-Amount
.
--- EDIT ---
Consider this simple example: We have three tables. One that saves company departments, one that stores the annual revenue for the departments and one that stores the monthly costs of these departments.
Table 1
DepartmentId | DepartmentName | NumberEmployees
5234 | "Software Development" | 20
3465 | "Sales" | 120
Table 2
DepartmentId | Year | Revenue
5234 | 2015 | 2,000,000
5234 | 2014 | 1,500,000
Table 3
DepartmentId | Year | Month | Cost
5234 | 2015 | Jan | 120,000
5234 | 2015 | Feb | 150,000
5234 | 2014 | Jan | 80,000
Out task is now to sum up the overall revenue of department 5234
as well as the overall costs.
If we join table 1 and table 2 we get:
DepartmentId | DepartmentName | NumberEmployees| Year | Revenue
5234 | "Software Development" | 20 | 2015 | 2,000,000
5234 | "Software Development" | 20 | 2014 | 1,500,000
With this table we could calculate the overall revenue.
If we join table 1 and 3 we get:
DepartmentId | DepartmentName | NumberEmployees | Year | Month | Cost
5234 | "Software Development" | 20 | 2015 | Jan | 120,000
5234 | "Software Development" | 20 | 2015 | Feb | 150,000
5234 | "Software Development" | 20 | 2014 | Jan | 80,000
With this table you can calculate the overall costs.
What you don't want to do though is joining all 3 tables, because then you get:
DepartmentId | DepartmentName | NumberEmployees| Year | Revenue | Month | Cost
5234 | "Software Development" | 20 | 2015 | 2,000,000 | Jan | 120,000
5234 | "Software Development" | 20 | 2015 | 2,000,000 | Feb | 150,000
5234 | "Software Development" | 20 | 2014 | 1,500,000 | Jan | 80,000
As you can see the 2015 revenue is duplicated because there are too costs entries for 2015 (Jan and Feb). If you use this table to compute both the overall revenue and the cost, you will end up with the wrong value.
So to wrap up and relate to your problem: You should use two separate queries to calculate your aggregations.
Joining two data.tables in r: removing overlap duplicates while keeping duplicates in each separate dataset
You could remove the overlaps
coming from y
:
l = list(dtx, dty)
dtxy = rbindlist(l, use.names = TRUE)
overlaps = merge(dtx,dty,by=c("ID","date","code"))[,.(ID,date,code,dataset = dataset.y)]
dtresultnew <- overlaps[dtxy,.(ID,date,code,x.dataset,i.dataset),on = .(ID,date,code,dataset)][
is.na(x.dataset),.(ID,date,code,dataset=i.dataset)]
identical(dtresult[order(ID,date,code)],dtresultnew[order(ID,date,code)])
[1] TRUE
Join two tables on multiple columns without duplicate rows
I would use exists
:
select b.*
from b
where exists (select 1 from a where a.id = b.id1) or
exists (select 1 from a where a.id = b.id2);
In most databases, this would be the most efficient method for this type of logic. I'm not 100% sure that this is true in Hive, but it is definitely worth a try.
An alternative approach would be left join
s:
select b.*
from b left join
a a1
on b.id1 = a1.id left join
a a2
on b.id2 = a2.id
where a1.id is not null or a2.id is not null;
This might have better performance in Hive, if the exists
does not have good optimization.
Related Topics
How Does 'In' Clause Works in Oracle
Query Last N Related Rows Per Row
Phpmyadmin - Total Record Count Varies
How to Write Blob from Oracle Column to the File System
How to Sum Multiple Lines in SQL
Search If Number Is Contained Within an Expression Like: 1-3,5,10-15,20
How to Skip Comma from CSV Using Double Quotes
How to Write a Simple Database Engine
How to Get the Date and Time from Timestamp in Postgresql Select Query
Is This Date Comparison Condition Sarg-Able in SQL
MySQL - How to Order Results by Alternating (1,2,3, 1, 2, 3, 1, 2, 3,) Rows, Is It Possible
Oracle 12C - Select String After Last Occurrence of a Character
How to Prevent SQL Injection in Wordpress