How to Get Array/Bag of Elements from Hive Group by Operator

How to get array/bag of elements from Hive group by operator?

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

How to count elements in group, when the element can be assigned to many groups

This should do the trick:

select
main.group,
count(1)
from
(
select
explode(split(group, ",")) as group
from
main
) main
group by
main.group

If your query is more complex than the sample you provided, you might also like to read about lateral view.

Ref. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

how to write case and group by in hive query

When you group by course and dept, you would get multiple values (comming for different records) for status column, this needs to be handled.
Any column in the select which is not part of group by should be within an aggregate function

here is a solution using sum() function.

select course, dept,
case when sum(case when status in ( 'fail','detained') then 1 else 0 end) > 0 then 'fail'
when sum(case when status in ('absent') then 1 else 0 end) > 0 then 'absent'
when sum(case when status in ('pass') then 1 else 0 end) > 0 then 'pass'
else 'no_result'
end as final_status
from college
group by
course,dept

Rank/ group in hive

I'm not able to get the results in single pass, so I have used intermediate tables. Hope the steps are explanatory. Check this out.

select * from sam;

+------------------+-----------+-------------+--+
| sam.record_date | sam.area | sam.reason |
+------------------+-----------+-------------+--+
| 2017-10-11 | A | a |
| 2017-10-12 | A | a |
| 2017-10-13 | B | b |
| 2017-10-14 | CDEF | A:B |
| 2017-10-15 | E | e |
| 2017-10-16 | F | f |
| 2017-10-17 | CDEF | E:F |
+------------------+-----------+-------------+--+

create table sam2 as select s1.*, s2.record_date as dt2 from sam s1 , ( select record_date,area,reason from sam where area='CDEF') s2 where s2.record_date >=s1.record_date;

+-------------------+------------+--------------+-------------+--+
| sam2.record_date | sam2.area | sam2.reason | sam2.dt2 |
+-------------------+------------+--------------+-------------+--+
| 2017-10-11 | A | a | 2017-10-14 |
| 2017-10-12 | A | a | 2017-10-14 |
| 2017-10-13 | B | b | 2017-10-14 |
| 2017-10-14 | CDEF | A:B | 2017-10-14 |
| 2017-10-11 | A | a | 2017-10-17 |
| 2017-10-12 | A | a | 2017-10-17 |
| 2017-10-13 | B | b | 2017-10-17 |
| 2017-10-14 | CDEF | A:B | 2017-10-17 |
| 2017-10-15 | E | e | 2017-10-17 |
| 2017-10-16 | F | f | 2017-10-17 |
| 2017-10-17 | CDEF | E:F | 2017-10-17 |
+-------------------+------------+--------------+-------------+--+

create table sam3 as select s.*, row_number() over(partition by record_date order by dt2) rn from sam2 s;

+----------------+---------+-----------+-------------+-----+--+
| s.record_date | s.area | s.reason | s.dt2 | rn |
+----------------+---------+-----------+-------------+-----+--+
| 2017-10-11 | A | a | 2017-10-14 | 1 |
| 2017-10-11 | A | a | 2017-10-17 | 2 |
| 2017-10-12 | A | a | 2017-10-14 | 1 |
| 2017-10-12 | A | a | 2017-10-17 | 2 |
| 2017-10-13 | B | b | 2017-10-14 | 1 |
| 2017-10-13 | B | b | 2017-10-17 | 2 |
| 2017-10-14 | CDEF | A:B | 2017-10-14 | 1 |
| 2017-10-14 | CDEF | A:B | 2017-10-17 | 2 |
| 2017-10-15 | E | e | 2017-10-17 | 1 |
| 2017-10-16 | F | f | 2017-10-17 | 1 |
| 2017-10-17 | CDEF | E:F | 2017-10-17 | 1 |
+----------------+---------+-----------+-------------+-----+--+

select s.*, dense_rank() over(order by dt2 ) rk from sam3 s where rn=1 order by record_date;

+----------------+---------+-----------+-------------+-------+-----+--+
| s.record_date | s.area | s.reason | s.dt2 | s.rn | rk |
+----------------+---------+-----------+-------------+-------+-----+--+
| 2017-10-11 | A | a | 2017-10-14 | 1 | 1 |
| 2017-10-12 | A | a | 2017-10-14 | 1 | 1 |
| 2017-10-13 | B | b | 2017-10-14 | 1 | 1 |
| 2017-10-14 | CDEF | A:B | 2017-10-14 | 1 | 1 |
| 2017-10-15 | E | e | 2017-10-17 | 1 | 2 |
| 2017-10-16 | F | f | 2017-10-17 | 1 | 2 |
| 2017-10-17 | CDEF | E:F | 2017-10-17 | 1 | 2 |
+----------------+---------+-----------+-------------+-------+-----+--+

Hive check elements in array

There are some collection functions in Hive `(See collection functions here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF ) which can use here.

You can use the array_contains(Array<T>, value) function to check if item 1 is present and the size(Array<T>) function to make sure the length is 1. If both conditions are satisfied, you will get the desired output.

How Pig's COGROUP operator works?

There is a very clear example in Definitive Guide book. I hope the below snippet helps you to understand the cogroup concept.

grunt> DUMP A;

(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;

(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)

grunt> D = COGROUP A BY $0, B BY $1;

grunt> DUMP D;

(0,{},{(Ali,0)})

(1,{(1,Scarf)},{})

(2,{(2,Tie)},{(Joe,2),(Hank,2)})

(3,{(3,Hat)},{(Eve,3)})

(4,{(4,Coat)},{(Hank,4)})

COGROUP generates a tuple for each unique grouping key. The first field of each tuple
is the key, and the remaining fields are bags of tuples from the relations with a matching
key. The first bag contains the matching tuples from relation A with the same key.
Similarly, the second bag contains the matching tuples from relation B with the same
key.

If for a particular key a relation has no matching key, then the bag for that relation is
empty.
For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default
type for COGROUP.



Related Topics



Leave a reply



Submit