Why Is Select Count(*) Slower Than Select * in Hive

Why is Select 1 faster than Select count(*)?

Since Oracle doesn't support IF EXISTS in PL/SQL, CodeByMidnight's suggestion to use EXISTS would normally be done with something like

SELECT 1 
INTO l_local_variable
FROM dual
WHERE EXISTS(
SELECT 1
FROM some_table
WHERE some_column = some_condition );

Oracle knows that it can stop processing the WHERE EXISTS clause as soon as one row is found, so it doesn't have to potentially count a large number of rows that match the criteria. This is less of a concern, of course, if you are checking to see whether a row with a particular key exists than if you are checking a condition involving unindexed columns or checking a condition that might result in a large number of rows being returned.

(Note: I wish I could post this as a comment on CodeByMidnight's post, but comments can't include formatted code).

UPDATE: Given the clarification the original poster made in their comment, the short, definitive answer is that a SELECT 1 or SELECT COUNT(1) is no faster than a SELECT COUNT(*). Contrary to whatever coding guidelines you are looking at, COUNT(*) is the preferred way of counting all the rows. There was an old myth that a COUNT(1) was faster. At a minimum, that hasn't been true in any version of Oracle released in the past decade and it is unlikely that it was ever true. It was a widely held belief, however. Today, code that does a COUNT(1) rather than a COUNT(*) generally makes me suspect that the author is prone to believe various Oracle myths which is why I would suggest using COUNT(*).

Hive multiple distinct on query running slow?

Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set()) will give you the distinct count.

select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;

Hive count distinct UDAF2

I'll try to explain.

Part 1:

What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?

Let's take the following query as an example:

select category, count(distinct brand) from market group by category;

One MapReduce Job will be launched for this query.

distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.

partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.

The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).

This design makes all rows with the same group-by key fall into the same reducer.

The value part of mappers' output doesn’t matter here.

Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.

Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.

Part 2:

Why there will be only one reducer in this case?
When calculating count distinct without group by like this:

    select count(distinct brand) from market

There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.

Part 3:

Why the weird inner query will cause more partitions?

The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.

But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.

One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.



Related Topics



Leave a reply



Submit