Why is Select 1 faster than Select count(*)?
Since Oracle doesn't support IF EXISTS in PL/SQL, CodeByMidnight's suggestion to use EXISTS would normally be done with something like
SELECT 1
INTO l_local_variable
FROM dual
WHERE EXISTS(
SELECT 1
FROM some_table
WHERE some_column = some_condition );
Oracle knows that it can stop processing the WHERE EXISTS clause as soon as one row is found, so it doesn't have to potentially count a large number of rows that match the criteria. This is less of a concern, of course, if you are checking to see whether a row with a particular key exists than if you are checking a condition involving unindexed columns or checking a condition that might result in a large number of rows being returned.
(Note: I wish I could post this as a comment on CodeByMidnight's post, but comments can't include formatted code).
UPDATE: Given the clarification the original poster made in their comment, the short, definitive answer is that a SELECT 1
or SELECT COUNT(1)
is no faster than a SELECT COUNT(*)
. Contrary to whatever coding guidelines you are looking at, COUNT(*)
is the preferred way of counting all the rows. There was an old myth that a COUNT(1)
was faster. At a minimum, that hasn't been true in any version of Oracle released in the past decade and it is unlikely that it was ever true. It was a widely held belief, however. Today, code that does a COUNT(1)
rather than a COUNT(*)
generally makes me suspect that the author is prone to believe various Oracle myths which is why I would suggest using COUNT(*)
.
Hive multiple distinct on query running slow?
Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set())
will give you the distinct count.
select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;
Hive count distinct UDAF2
I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAFGenericUDAFCount
is capable of bothcount
andcount distinct
. How does it work to achievecount distinct
?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by
keys are the expressions(columns) within count(distinct ...
, in this case, brand
.
partition-by
keys are the fields used to calculate a hash code for a record at map
phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by
keys lies in the group by
part of a SQL query. In this case, it's category
.
The actual output-key
of mappers will be the composition of partition-by
key and a distinct-by
key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by
key fall into the same reducer.
The value
part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by
keys, which is the same as the output key
.
Then at reduce
phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct )
aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*)
at each group. The input key of a call to the reduce
method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculatingcount distinct
withoutgroup by
like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by
key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by
key is the group by
key, id
. There’s a chance that id
values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id
are different from each other. So now a simple count(1)
is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1)
, map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.
Related Topics
Oracle SQL Order by in Subquery Problems!
Check Users in a Security Group in SQL Server
Are There Reasons for Not Storing Boolean Values in SQL as Bit Data Types
Why Does Connect by Level on a Table Return Extra Rows
Sql Server Store Multiple Values in SQL Variable
Sql Server 2008 Hierarchy Data Type Performance
T-Sql Stop or Abort Command in SQL Server
Select Same Column from Multiple Tables Only Where Something = Something
Write a Postgres Get or Create SQL Query
Indexing and Query High Dimensional Data in Postgresql
How to Get the Sum of All Column Values in the Last Row of a Resultset
How to Use The Same Table Twice in a Select Query
Return SQL Rows Where Field Contains Only Non-Alphanumeric Characters
Sql Server 2005 Using Charindex() to Split a String
Calling Shell Script from Pl/Sql, But Shell Gets Executed as Grid User, Not Oracle
How to Convert Result of an Select SQL Query into a New Table in Ms Access