Is Limit Clause in Hive Really Random

Is LIMIT clause in HIVE really random?

Even though the documentation states it returns rows at random, it's not actually true.

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?

If no optimizer applied, hive end up scanning entire table. But Hive optimizes
this with hive.fetch.task.conversion released as part of HIVE-2925, To
ease simple queries with simple conditions and not to run MR/Tez at
all.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)

minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only

more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

Your question is more likely what happens when minimal or more is set.
It just scans through the added files and read rows until reach
leastRows() For more refer gitCode, Config and here

Hive 'limit' in subquery executes after full query

The limit is not applied after the case, but before and during processing the case - it actually gets applied twice. Although it is a coincidence, in this case the two applications of limit correspond to the inner and the outer query, respectively.

In the query plan you can see that the Map phase just selects a single column ("expressions: domainname") and also reduces the number of results to 10 (from 157462377267). This corresponds to the inner query. Then the Reduce phase applies the case ("expressions: CASE WHEN ((_col0 rlike '.*:443$')) THEN (1) ELSE (0) END") and also reduces the number of rows to 10, but you can see that the expected number of input rows is already 10 in this phase. The Reduce phase corresponds to the outer query.

The reason why the limit is applied twice is the distributed execution. Since at the end of the Map phase you want to minimize the amount of data sent to the Reducers, it makes sense to apply the limit here. After the limit is reached, the Mapper won't process any more of the input. This is not enough however, since potentially each Mapper may produce up to 10 results, adding up to ten times the number of Mappers, thereby the Reduce phase has to apply the limit again. Because of this mechanism, in general you should apply the limit directly instead of creating a subquery for this sole purpose.

To summarize, in my interpretation the query plan looks good - the limit is processed in the places where it should. This answers your question about why the limit gets applied before the case. Sadly, it does not explain why it takes so much time though.

Update: Please see ozw1z5rd's answer about why this query is slow in spite of using limit. It explains that using a subquery causes a MapReduce job to be launched, while a direct query avoids that.

How to list first 10 rows in each category in Hive SQL

You can use row_number() in most databases include Hive. For 10 examples per category, for instance:

select t.*
from (select t.*,
row_number() over (partition by category order by category) as seqnum
from t
) t
where seqnum <= 10;

Limit by month rather than all results

Here is one method:

select t.*
from (select t.*,
row_number() over (partition by month order by random()) as seqnum
from t
) t
where sequm <= 1000;

Huge Number of Ids in IN clause in Hadoop Hive query

You have few options:

Join.
Put all id's into a file in HDFS, create table on top of file directory.

CREATE EXTERNAL TABLE table_ids(item_id int)
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
location '/hive/data' --location(directory) in hdfs where the file is
;
select item_name from table a
inner join table_ids b on a.item_id=b.item_id

Using in_file:
Put all ids into file, one id in a row.

select item_name from table where in_file(item_id, '/tmp/myfilename'); --local file

Using join with stack, if it fits in memory:

select item_name from table a
inner join
(
select stack(10, --the number of IDs, add more IDs
0, 1, 2, 3, 4, 5, 6, 7, 8, 9) as (item_id)
) b
on a.item_id=b.item_id

Transforming hive IN subselect query combined with WHERE replacement

select id from foo 
left semi join
(SELECT id_2 FROM bar WHERE x=true RAND() LIMIT 100) x
ON foo.id=x.id_2

Should be like this.

I just don't understand this part : x=true RAND()

Also, this doesn't handle nulls just like your query.



Related Topics



Leave a reply



Submit