Oracle Subquery Funniness

How does correlated query with no aliases work?

This is your query:

SELECT *
FROM basetbl
WHERE b IN (SELECT b FROM nob WHERE c = 1);

First, you should always qualify column names, so the query should look more like this:

SELECT bt.*
FROM basetbl bt
WHERE bt.b IN (SELECT n.b FROM nob n WHERE n.c = 1);

Then you would get an error.

Instead, Oracle applies its default scoping rules (which are similar to the scoping rules for all SQL databases). So, when it doesn't find b in nob, it looks in the outer query. The query is interpreted as:

SELECT bt.*
FROM basetbl bt
WHERE bt.b IN (SELECT bt.b FROM nob n WHERE n.c = 1);

ORACLE - to subquery or not to subquery?

not exists or not in would both work, but maybe you're fooled by the 'or's in the query. Because of operator precedence, you have to add some parentheses:

Using in, embedding the second query exactly as you wrote it:

SELECT ENTITY_ID FROM KRIM_ENTITY_NM_T KE
    WHERE 
    ( TO_NUMBER(KE.PREFIX_NM) >= 5 OR
      lower(KE.TITLE_NM) LIKE '%professor%' OR
      lower(KE.TITLE_NM) LIKE '%senior lecturer%' OR
      lower(KE.SUFFIX_NM) LIKE '%professor%' OR
      lower(KE.SUFFIX_NM) LIKE '%senior lecturer%'
    ) AND
    ENTITYID NOT IN (
      SELECT MBR_ID FROM KRIM_GRP_MBR_T KG WHERE KG.MBR_TYP_CD = 'P' AND KG.GRP_ID = '2012'
    )

Using exists:

SELECT ENTITY_ID FROM KRIM_ENTITY_NM_T KE
    WHERE 
    ( TO_NUMBER(KE.PREFIX_NM) >= 5 OR
      lower(KE.TITLE_NM) LIKE '%professor%' OR
      lower(KE.TITLE_NM) LIKE '%senior lecturer%' OR
      lower(KE.SUFFIX_NM) LIKE '%professor%' OR
      lower(KE.SUFFIX_NM) LIKE '%senior lecturer%'
    ) AND
    NOT EXISTS (
      SELECT * FROM KRIM_GRP_MBR_T KG 
      WHERE KG.MBR_TYP_CD = 'P' AND 
        KG.GRP_ID = '2012' AND
        KG.MBRID = KE.ENTITYID
    )

oracle: tune correlated subqueries in select clause

This is quite a loaded question.. There is not enough info here (e.g. execution plan, record counts, available indexes, pysical location of data in table, etc) to be able to tune the query properly. The best anyone can do is give a guess. That being said.. here is my best guess:

select distinct
    tab_1.col_1,
    tab_1.col_2 AS report,
    tab_2.col_3 AS depot,
    COUNT (DISTINCT (case when tab_3.col_8 = 'Parts Shortage' then tab_3.col_4 end)) over (partition by tab_1.col_1, tab_1.col_2) AS p_shortages,
    COUNT (DISTINCT (case when tab_3.col_8 = 'Parts Shortage' then TRIM (tab_3.col_10) end)) over (partition by tab_1.col_1, tab_1.col_2)  AS parts_affected,
    COUNT (case when tab_3.col_8 = 'Parts Shortage'  and tab_3.col_7 = 1 then tab_3.col_7 end) over (partition by tab_1.col_1, tab_1.col_2)  AS parts_category1,
    COUNT (case when tab_3.col_8 = 'Unsrv Asset' then tab_3.col_4 end) over (partition by tab_1.col_1, tab_1.col_2)  AS u_shortages,
    COUNT (case when tab_3.col_8 = 'Unsrv Asset' then tab_3.col_10 end) over (partition by tab_1.col_1, tab_1.col_2)  AS u_affected,
    COUNT (DISTINCT (case when tab_3.col_8 = 'Unsrv Asset' and tab_3.col_7 = 1 then TRIM(tab_3.col_7) end)) over (partition by tab_1.col_1, tab_1.col_2)  AS unsrv_cat1,
    TO_CHAR(tab_1.col_11, 'MM/DD/YY') AS report_date
from tab_1
    left outer join tab_2
        on tab_2.col_1 = tab_1.col_1
    left outer join tab_3
        on     tab_3.col_9 = tab_1.col_1
        AND tab_3.col_6 = tab_1.col_2 
        AND tab_3.col_5 IS NULL;

After looking closer at the query above to explain what I did, I modified it further. This was because I was grouping by TO_CHAR(tab_1.col_11, 'MM/DD/YY'), but saw this wasn't part of the correlated subquery criteria, so it had to change (unless the combination of tab_1.col_1 and col_2 are unique by themselves).

So now to try to explain this:

Basically what is happening in the original query is that for every row of tab_1, you are running several queries on tab_3. So instead of that, I changed it to an outer join on tab_3. Since I don't know the data, this had to be an outer join because a correlated subquery won't eliminate any rows from the final output, where an inner join might. I just joined to tab_3 once, since all of the subqueries were joining tab_3 back to tab_1 using the same fields. I just moved the specific logic for each subquery into a case statement inside the count so that the case would return null (and thus not get counted) if the criteria wasn't met. This newest version uses aggregate functions to get my calculations at the correct level (tab_1 col_1 and col_2 which is what the original subquery was joining based on). Since converting the subquery to an outer join could have possibly created more rows (if there isn't a 1-to-1 match between the tables) I added the distinct to only get one row for each row in tab_1. All rows for each row in tab_1 should be the same. If there were already duplicates in tab_1, you will have to do something a little more in-depth to get keep the number of records the same.

Hopefully this makes sense. If you have questions, feel free to ask and I will do my best to explain further.

--------------------------- More explanation

@shawno: Kind of, but I don't really think of it much as a loop.. Using your simplified example, let's pretend your tables look like this:

TAB_1:

col_1           col_2
--------        ---------
A               B
C               D

TAB_3:

col_9           col_6          col_4
--------        ---------      ---------
A               B              X
A               B              Y
A               B              Z
C               D              X
C               D              X

Using the subquery method, you are looking at each row of tab_1 and then running a query against tab_3. So you would do:

for row col_1 = A, col_2 = B, run a select count(distinct(col_4)) on tab_3 where col_9 = A and col_6 = B. This returns the value 3, which is the value the subquery returns.

for row col_1 = C, col_2 = D, run a select count(distinct(col_4)) on tab_3 where col_9 = C and col_6 = D. This returns the value 1, which is the value the subquery returns.

Using the join method, you first join the tables giving you data like:

col_1           col_2          col_9           col_6          col_4
--------        ---------      --------        ---------      ---------
A               B              A               B              X
A               B              A               B              Y
A               B              A               B              Z
C               D              C               D              X
C               D              C               D              X

So now you just have to make a query based on that data, doing a Count(distinct(col_4)) for each value of col_1, col_2. If you know what your data looks like, you can create a more efficient query, but the idea remains the same.

Hope this makes it a little more clear!

Wrong SELECT subquery in 'IN' condition

The query is correct if table_B does not have a column named ID_NO, but table_A has. Then you would have a correlated subquery, where the subquery select ID_NO refers to the outer ID_NO-attribute of table_A (persumably makes no sense, but is correct for the compiler).

Consider the following schema:

create table table_a (
  id_no int
  );

create table table_b (
    other_id_no int
    );

insert into table_a values (1),(2);
insert into table_b values (1),(3);

Then, the following query will compile; but it will always yield an empty result, because it actually means something like where id_no not in (id_no):

select * from  table_a where id_no not in (select id_no from table_b);

When dealing with subqueries, I'd suggest to use table aliases in order to avoid such unintended behaviour. For example, the following query does not compile, and the compiler gives you the hint what is wrong:

select * from  table_a a where a.id_no not in (select b.id_no from table_b b);
Error: Unknown column 'b.id_no' in 'field list'

Correcting the error then leads to:

select * from  table_a a where a.id_no not in (select b.other_id_no from table_b b);

SQL - query inside NOT IN takes longer than the complete query?

It seems you are thinking that your main query implies the following steps:

(1)  Run the subquery
(2)  Check each row in table1 against the result set from the subquery.

Therefore, you think that running the subquery separately must take less time than running the whole query.

But SQL is not a procedural language, and the structure of the query does not necessarily imply the steps that will be followed to execute the query.

As Guffa answered, the optimizer will come up with (what it believes is) an optimal plan to execute each query. These execution plans are not always obvious from looking at the query, and in some cases can indeed be quite counter-intuitive.

I think that it is most likely, in this case, that the optimizer has come up with a quicker method for checking whether a value exists in table2 than simply querying all of table2 at once. It could be the transformation Guffa showed (although that still does not tell you the exact execution plan being used).

I would guess that table1 has significantly fewer rows than table2, and an index exists on table2.columnB. So all it has to do is fetch the rows from table1, then probe the index for each of those value to check for existence. But this is only one possibility.

Also, as Michael Buen pointed out, differences in the size of the result set returned can also impact your perceived performance. My intuition is that this is secondary to the execution plan differences, but it can be significant.

Select List of Column Names / Aliases from Custom Sub-Query

Here is how to do it in PL/SQL. Don't know if it is possible with straight oracle SQL only. You could always encapulate it in some kind of function if needed.

DECLARE
   TYPE RefCursor_Type IS REF CURSOR;

   D_RefCur             RefCursor_Type;
   D_DescriptionTable   DBMS_SQL.DESC_TAB2;
   D_ColumnCount        INTEGER;
   D_CursorHandle       INTEGER;
BEGIN
   OPEN D_RefCur
   FOR 'SELECT FIRST_NAME AS COL1, LAST_NAME AS COL2, ADDRESS AS COL3 FROM PEOPLE';

   D_CursorHandle := DBMS_SQL.to_cursor_number (D_RefCur);

   DBMS_SQL.DESCRIBE_COLUMNS2 (D_CursorHandle,
                               D_ColumnCount,
                               D_DescriptionTable);

   FOR idx IN 1 .. D_ColumnCount
   LOOP
      DBMS_OUTPUT.put_line (D_DescriptionTable (idx).col_name);
   END LOOP;
END;

Avoiding Nested Queries

It really depends, I had situations where I improved some queries by using subqueries.

The factors that I am aware are:

if the subquery uses fields from outer query for comparison or not (correlated or not)
if the relation between the outer query and sub query is covered by indexes
if there are no usable indexes on the joins and the subquery is not correlated and returns a small result it might be faster to use it
i have also run into situations where transforming a query that uses order by into a query that does not use it and than turning it into a simple subquery and sort that improves performance in mysql

Anyway, it is always good to test different variants (with SQL_NO_CACHE please), and turning correlated queries into joins is a good practice.

I would even go so far to call it a very useful practice.

It might be possible that if correlated queries are the first that come to your mind that you are not primarily thinking in terms of set operations, but primarily in terms of procedural operations and when dealing with relational databases it is very useful to fully adopt the set perspective on the data model and transformations on it.

EDIT:
Procedural vs Relational
Thinking in terms of set operations vs procedural boils down to equivalence in some set algebra expressions, for example selection on a union is equivalent to union of selections. There is no difference between the two.

But when you compare the two procedures, such as apply the selection criteria to every element of an union with make a union and then apply selection, the two are distinctly different procedures, which might have very different properties (for example utilization of CPU, I/O, memory).

The idea behind relational databases is that you do not try to describe how to get the result (procedure), but only what you want, and that the database management system will decide on the best path (procedure) to fulfil your request. This is why SQL is called 4th generation language (4GL).

One of the tricks that help you do that is to remind yourself that tuples have no inherent order (set elements are unordered).
Another is realizing that relational algebra is quite comprehensive and allows translation of requests (requirements) directly to SQL (if semantics of your model represent well the problem space, or in another words if meaning attached to the name of your tables and relationships is done right, or in another words if your database is designed well).

Therefore, you do not have to think how, only what.

In your case, it was just preference over correlated queries, so it might be that I am not telling you anything new, but you emphasized that point, hence the comment.

I think that if you were completely comfortable with all the rules that transform queries from one form into another (rules such as distributiveness) that you would not prefer correlated subqueries (that you would see all forms as equal).

(Note: above discusses theoretical background, important for database design; practically the above concepts deviate - not all equivalent rewrites of a query are necessarily executed as fast, clustered primary keys do make tables have inherit order on disk, etc... but these deviations are only deviations; the fact that not all equivalent queries execute as fast is an imperfection of the actual DBMS and not the concepts behind it)