Find Top 10 Latest Record for Each Buyer_Id for Yesterday's Date

Find TOP 10 latest record for each BUYER_ID for yesterday's date

  SELECT FIRST 10 *
FROM TestingTable1
WHERE buyer_id = 34512201
ORDER BY created_time DESC;

Returning the latest record for each category

So maybe this ROW_NUMBER() based solution will work for you.
However I have doubts about your WHERE condition. Maybe just remove it, unless you don't want to show something from yesterday which happens to be the latest job for a side?

SELECT 
job_num
FROM
(
SELECT job_num,
ROW_NUMBER() OVER (PARTITION BY RIGHT(LEFT(Skid_Serial, 2), 1) ORDER BY
Kit_Timestamp DESC) as RN,
Kit_Timestamp
FROM Line_Cummins_MDC_Kit_Cell1.dbo.Line_Cummins_MDC_Kit LK
WHERE CONVERT(date, kit_timestamp) > dateadd(day, - 1, CONVERT(date, getdate()))
) T
WHERE RN=1
ORDER BY Kit_Timestamp DESC

Hive getting top n records in group by query

You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(user-id) as rank, clicks
FROM mytable
DISTRIBUTE BY page-id, user-id
SORT BY page-id, user-id, clicks desc
) a
WHERE rank < 5
ORDER BY page-id, rank

Hive Query : To calculate max indicator value based on priority and date

with your_table as -------use your table instead of this subquery
(
select stack(6,

1 ,'A', 'y','n', '2009/01/01',
1 ,'B', 'n','y', '2019/02/09',
1 ,'C', null,'' , '2018/05/07',
2 ,'A', null,'y', '2005/02/02',
2 ,'B', null,'y', '2006/05/05',
2 ,'C', 'n', null, '2018/01/01'

) as (Ik, priority, ind1, ind2, date)
) -------use your table instead of this subquery

select ik,
max(case when priority ='A' and ind1='y' then 'y' else last_ind1 end) ind1,
max(case when priority ='A' and ind2='y' then 'y' else last_ind2 end) ind2
from
(
select Ik, priority, ind1, ind2, date,
last_value(ind1) over (partition by Ik order by date) last_ind1,
last_value(ind2) over (partition by Ik order by date) last_ind2
from your_table -------use your table instead
)s
group by ik;

Result:

ik  ind1    ind2
1 y y
2 n y

Find TOP 3 of two columns match

select bid, pid, [time] from (
select bid, pid, [time], rank() over (partition by bid, pid order by [time] desc) as k
from #temp ) as x
where k <=3
order by bid, pid, time desc

oh i'm in sql server. i don't think you are........

anyway. my recommendation is that you move your rank function inside of the nested select you have. in the outside select you want it where it is less than three... i don't know your syntax. i shouldn't have answered this question. sorry.... lol

here:
http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
your rank() is in the outer select... it needs to be in the inner. leave the < 4 or <= 3 or whatever in the outer where statement, though. your query almost looks exactly like that example... just needs a few changes.

based on the link and my absolute LACK of knowledge of Hive... i think you might want this:

SELECT bid, pid, time
FROM (
SELECT bid, pid, rank(time) as rank, time
FROM $compTable
DISTRIBUTE BY bid, pid
SORT BY bid, pid, time desc
) a
WHERE rank < 4
ORDER BY bid, pid, time desc

and i can't test or compile because honestly i had no clue what hive was before you posted your question. (small world, i know, so sad - so true)

Apache Hive Query HiveQL

After reading some more docs and the hints from the linked questions:

SELECT dealer, make, rank, type FROM (
SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
) CountedSales
) RankedSales
WHERE RankedSales.rank < 3;

Inner query does counting, middle query performs rank() and the outer query limits on rank.

Sales table contents

hive> select * from Sales;
OK
Xyz Highlander SUV NULL
Xyz Highlander SUV NULL
Xyz Rouge SUV NULL
Xyz Rouge SUV NULL
Xyz Prius HATCH NULL
Xyz Prius HATCH NULL
Xyz Prius HATCH NULL
Xyz Versa HATCH NULL
Xyz S3 SEDAN NULL
Xyz S3 SEDAN NULL
Xyz S3 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Time taken: 0.054 seconds, Fetched: 15 row(s)

Now the actual query.

hive> SELECT dealer, make, rank, type FROM (                                                                          
> SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
> SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
> ) CountedSales
> ) RankedSales
> WHERE RankedSales.rank < 3;
...
Execution completed successfully
MapredLocal task succeeded
OK
Xyz Prius 1 HATCH
Xyz Versa 2 HATCH
Xyz A8 1 SEDAN
Xyz S3 2 SEDAN
Xyz Rouge 1 SUV
Xyz Highlander 1 SUV
Time taken: 28.491 seconds, Fetched: 6 row(s)

Joining two Tables in Hive using HiveQL(Hadoop)

EDIT - PART 1
Okay - For some reason I am going to explain myself - so to start with I stumbled upon this question because of the SQL tag, and saw Hive, and started to not look and just skip it. BUT then I noticed it had been over a day and you had gotten no answers. I looked - I saw a SQL logic correction in the original query posted that I knew would be needed and would help, so I posted ONLY because no one had answered. I will try to address this last question - but after that I am keeping my advice to myself, as I may be giving bad advice. Good luck! I tried! And you seem to be getting answers now, so...

In TSQL, I could solve this entire problem with the below single query:

SELECT * 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)

It would return everything, including your match buyer_id/user_id only. It won't match a buyer_id/user_id row with no matches in either time or product in the other table, but it will return it as a separate row with NULLS in the other table's fields. I would not match these any way - there is no accurate information provided to do it with as explained below.

END EDIT PART 1

If you can't do FULL OUTER JOIN with OR in Hive, the simplest way to meet the original criteria is to UNION ALL 2 INNER JOINs. On one of the queries, in addition to joining the matching user_ids, join on the PRODUCT_ID AND in your WHERE look for TIMESTAMPS that don't match CREATED_TIME. On the second query, in addition to joining the matching user_ids, join on the times AND in your WHERE look for products that don't match.

EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA

If I understand the last criteria it is any record in either table that has a matching user_id = buyer_id, but nothing else matches. The FULL OUTER JOIN with OR condition will return them, but there isn't enough provided info for a way to relate the records to each other. We can easily identify them, but have no way to tie them back to each other. If you do so and you have more than one record without a match in either OR both tables, there are going to be multiple entries for each.

Any query I wrote to try to tie them without more info (and probably with) would be a guess and inaccurate.

For example, in the first table if there were these 2 (sample fake) records with nothing matching in the second except user_id:

1015826235  420003038067    2011-11-03 19:40:21.000
1015826235 720003038067 2004-11-03 19:40:21.000

AND in table2 - these non matching:

1015826235  {"product_id":520003038067,"timestamps":"10...
1015826235 {"product_id":620003038067,"timestamps":"10...

You can identify them, but if you match them without more criteria you get 4 instead of 2:

1015826235  420003038067    2011-11-03 19:40:21.000 1015826235 520003038067
1015826235 420003038067 2011-11-03 19:40:21.000 1015826235 620003038067
1015826235 720003038067 2004-11-03 19:40:21.000 1015826235 520003038067
1015826235 720003038067 2004-11-03 19:40:21.000 1015826235 620003038067

My suggestion would be simply to identify them and show them, as below.

BUYER_ID        ITEM_ID      CREATED_TIME           USER_ID PRODUCTID   timestamps  
----------------------------------------------------------------------
NULL NULL NULL 1015826235 520003038067 2009-11-11 22:21:11.000
NULL NULL NULL 1015826235 620003038067 2008-11-11 22:21:11.000
1015826235 420003038067 2011-11-03 19:40:21.000 NULL NULL NULL
1015826235 720003038067 2004-11-03 19:40:21.000 NULL NULL NULL

END EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA - PART 1

I am working with TSQL, so I can't test for you an exact query with your syntax, but the concepts of the joins are the same, and this will return what you want. I did take your query and attempt your syntax, modify as needed. I tested in TSQL. You may be able to take this and improve upon it with functionality in HiveQL. There are other ways to do this - but this is the most straightforward and this will translate to HiveQL.

REMOVED, YOU GOT THIS PART AND IT IS INCLUDED LATER

(Again modify syntax as needed)**

SELECT *
FROM (
SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS
FROM testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.timestamps = UNIX_TIMESTAMP (table2.created_time)
WHERE prod_and_ts.product_id <> A.ITEM_ID
UNION ALL
SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS
FROM testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.product_id = A.ITEM_ID
WHERE prod_and_ts.timestamps <> UNIX_TIMESTAMP (table2.created_time)
) X

And here is my tested TSQL version with my table names for reference:

SELECT * 
FROM(
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.t1time = A.Created_TIME
WHERE B.PRODUCTID <> A.ITEM_ID
UNION ALL
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.PRODUCTID = A.ITEM_ID
WHERE B.t1time <> A.Created_TIME
) X

*EDIT PART 3 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA -PART 2

In TSQL the entire query (no unions) can be run using a FULL OUTER JOIN with an OR condition on the join

SELECT * 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)

If you can't simply do the above, For the SQL logic for the new criteria - to grab those that don't match from both tables and display them as NULL in the other table use RIGHT JOIN and LEFT JOIN.
RIGHT JOIN will grab anything in the first table the matches the second and everything in the second, and LEFT does the opposite. Add the new queries to your UNION.

TSQL EXAMPLE - MODIFY FOR HIVE

SELECT * 
FROM SO_Table1HIVE A
RIGHT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
WHERE A.BUYER_ID IS NULL
UNION ALL
SELECT *
FROM SO_Table1HIVE A
LEFT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
WHERE B.[USER_ID] IS NULL

Or, If you wanted to grab them and match them as duplicates add to UNION:

TSQL

SELECT * 
FROM SO_Table1HIVE A
JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
WHERE B.t1time NOT IN(SELECT Created_TIME FROM SO_Table1HIVE)
AND A.Created_TIME NOT IN(SELECT t1time FROM SO_Table2HIVE)
AND B.PRODUCTID NOT IN(SELECT ITEM_ID FROM SO_Table1HIVE)
AND A.ITEM_ID NOT IN(SELECT PRODUCTID FROM SO_Table2HIVE)

Again, Good luck!

Custom Mapper and Reducer vs HiveQL

The answer to your question is two-fold.

Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.

Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.

Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.



Related Topics



Leave a reply



Submit