Count Table Difference Between Two Tables by a Complex Key in Hive

comparing records in two hive tables having same schema

This is what you can do:

Join both the tables using the unique key( i believe u must be having unique identifier in ur table)
use the hash value of all the columns combined using hash function in hive to figure out the difference.query will look like this:

select * from tab1 a join tab2 b
using a.id=b.id
where hash(a.col1,a.col2....)<>hash(b.col1,b.col2...);

how to apply multiple count in hive query

Use group by and aggregation as count(*) in your select query

Try with this query:

select act,count(*) from <table_name> group by act;

Comparing two tables for equality in HIVE

The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.

The second one will find rows that exist in t1 but not in t2.

To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:

select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */

Hive: Joining two tables with different keys

It's little difficult to do this Hive as there are many limitations. This is how I solved it but there could be a better way.

I named your tables as below.
Table1 = EmpActivity
Table2 = ActivityMas

The challenge comes due to the null fields in Table2. I created a view and Used UNION to combine result from two distinct queries.

Create view actView AS Select * from ActivityMas Where Activityid ='';

SELECT * From (
Select EmpActivity.EmpId, EmpActivity.Category, ActivityMas.categdesc
from EmpActivity JOIN ActivityMas
ON EmpActivity.Category = ActivityMas.Category
AND EmpActivity.ActivityId = ActivityMas.ActivityId
UNION ALL
Select EmpActivity.EmpId, EmpActivity.Category, ActView.categdesc from EmpActivity
JOIN ActView ON EmpActivity.Category = ActView.Category
)

You have to use top level SELECT clause as the UNION ALL is not directly supported from top level statements. This will run total 3 MR jobs. ANd below is the result I got.

44127   10      billable
44128 12 billable
44130 15 Non-billable
44132 43 Benefits
44131 33 Benefits
44126 33 Training
44129 33 Bench

Best way to compare three columns in sql Hive

Logically you have an issue.

col1 = col2

Therefore if col1 != col3 then col2 != col3;

There for it's really enough to use:

select * from T1 where col1 = col2 and col1 != col3;

It is appropriate to do this map side so using a where criteria is likely good enough.

If you wanted to say 2 out of the 3 need to match you could use group by with having to reduce comparisons.



Related Topics



Leave a reply



Submit