Is LIMIT clause in HIVE really random?
Even though the documentation states it returns rows at random, it's not actually true.
It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.
As soon as you slap a order by x DESC limit 5
on there, it returns the last 5 rows of whatever you're selecting from.
To get rows returned at random, you would need to use something like: order by rand() LIMIT 1
However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset
HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?
If no optimizer applied, hive end up scanning entire table. But Hive optimizes
this with hive.fetch.task.conversion released as part of HIVE-2925, To
ease simple queries with simple conditions and not to run MR/Tez at
all.Supported values are none, minimal and more.
none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)
minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
Your question is more likely what happens when minimal or more is set.
It just scans through the added files and read rows until reach
leastRows() For more refer gitCode, Config and here
Hive 'limit' in subquery executes after full query
The limit
is not applied after the case
, but before and during processing the case
- it actually gets applied twice. Although it is a coincidence, in this case the two applications of limit correspond to the inner and the outer query, respectively.
In the query plan you can see that the Map phase just selects a single column ("expressions: domainname
") and also reduces the number of results to 10 (from 157462377267). This corresponds to the inner query. Then the Reduce phase applies the case ("expressions: CASE WHEN ((_col0 rlike '.*:443$')) THEN (1) ELSE (0) END
") and also reduces the number of rows to 10, but you can see that the expected number of input rows is already 10 in this phase. The Reduce phase corresponds to the outer query.
The reason why the limit is applied twice is the distributed execution. Since at the end of the Map phase you want to minimize the amount of data sent to the Reducers, it makes sense to apply the limit here. After the limit is reached, the Mapper won't process any more of the input. This is not enough however, since potentially each Mapper may produce up to 10 results, adding up to ten times the number of Mappers, thereby the Reduce phase has to apply the limit again. Because of this mechanism, in general you should apply the limit directly instead of creating a subquery for this sole purpose.
To summarize, in my interpretation the query plan looks good - the limit
is processed in the places where it should. This answers your question about why the limit
gets applied before the case
. Sadly, it does not explain why it takes so much time though.
Update: Please see ozw1z5rd's answer about why this query is slow in spite of using limit
. It explains that using a subquery causes a MapReduce job to be launched, while a direct query avoids that.
How to list first 10 rows in each category in Hive SQL
You can use row_number()
in most databases include Hive. For 10 examples per category, for instance:
select t.*
from (select t.*,
row_number() over (partition by category order by category) as seqnum
from t
) t
where seqnum <= 10;
Limit by month rather than all results
Here is one method:
select t.*
from (select t.*,
row_number() over (partition by month order by random()) as seqnum
from t
) t
where sequm <= 1000;
Huge Number of Ids in IN clause in Hadoop Hive query
You have few options:
Join.
Put all id's into a file in HDFS, create table on top of file directory.
CREATE EXTERNAL TABLE table_ids(item_id int)
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
location '/hive/data' --location(directory) in hdfs where the file is
;
select item_name from table a
inner join table_ids b on a.item_id=b.item_id
Using in_file:
Put all ids into file, one id in a row.
select item_name from table where in_file(item_id, '/tmp/myfilename'); --local file
Using join with stack, if it fits in memory:
select item_name from table a
inner join
(
select stack(10, --the number of IDs, add more IDs
0, 1, 2, 3, 4, 5, 6, 7, 8, 9) as (item_id)
) b
on a.item_id=b.item_id
Transforming hive IN subselect query combined with WHERE replacement
select id from foo
left semi join
(SELECT id_2 FROM bar WHERE x=true RAND() LIMIT 100) x
ON foo.id=x.id_2
Should be like this.
I just don't understand this part : x=true RAND()
Also, this doesn't handle nulls just like your query.
Related Topics
Recursive Query Challenge - Simple Parent/Child Example
How to Convert Result of an Select SQL Query into a New Table in Ms Access
Sql Server Left Join with 'Or' Operator
Linked Access Db "Record Has Been Changed by Another User"
Convert Datetime to Unix Timestamp
Replace Multiple Characters in Sql
How to Order by Column with Non-Null Values First in Sql
Why Is Select Count(*) Slower Than Select * in Hive
Sql Server 2005 Unique Constraint on Two Columns
SQL Query of Multi-Member File on As400
Creating SQL Table Using Dynamic Variable Name
Excel Vlookup Incorporating SQL Table
How to Exclude Tables from Sp_Msforeachtable
How Replace Accented Letter in a Varchar2 Column in Oracle
Any Disadvantages to Bit Flags in Database Columns