Is SQL in Bad For Performance

Is SQL IN bad for performance?

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

SQL IN Query performance - better split it or not

Let's say 1/3 of the visitors will watch though all of the 1000 id's while 2/3 of the them will only watch the first 50.

Since you want to optimize your response as you assumed how visitors will treat it.

What would be a better for performance/workload, one query for all the 1000 id's or split them into like 20 queries so 50 id's each? So when the first 50 have been watched, query for the next 50 etc.

Yes, you are correct you should limit the return response.
This is one example of how you can implement your requirement (I don't know much mysql but this is how you could get desired result).

SELECT * FROM `table` WHERE `id` IN (23221, 42422, 2342342....)
order by `id`
LIMIT 10 OFFSET 10

if it was SQL SERVER:

  create stored proc sp_SomeName
  @id varchar(8000)
  @skip int,
  @take int
  as
  begin
     SELECT * FROM some_table WHERE id IN (23221, 42422, 2342342....)
     order by id
     OFFSET @skip ROWS --if 0 then start selecting from 0 
     FETCH NEXT @take ROWS ONLY --if 10 then this is the max returning limit
  end

what above query will do is : It will get all the data of the posted ids, then it will order by id in ascending order. Then from their it will choose just first 10/50/100, next time, it will choose the next 10/50/100 or whatever your take choice is and skip choice is. Hope this helps man :)

is IN(SELECT ...) bad for performance?

SQL server query optimizer is smart enough to not run the same subquery over and over again. If anything, the temp table is less optimal because of additional steps after getting the results.

You can see this by looking at the SQL query execution plan.

Sample Image

Edit: After looking into this further, it can also be more than once. Apparently query optimizer can also do a lot of interesting things like convert your IN to a JOIN to increase performance. There's lots of information on it here: Number of times a nested query is executed

None the less, view your execution plan to see what your RDMS's query optimizer decided to do.

Sql IN clause slows down performance

Problem was indicies of table which holds user data

Here is Solution ;

1- write your query to Query Editor and Click "Display Estimated Execution Plan" button ..

Sample Image

2- SQL Server gives you hints and query about index in "Execution Plan Window" that should be created on table

Sample Image

3- Right Click on Execution Plan window and choose "Missing Index Details"

Sample Image

4- In Opend Query Page Rename index name ([] to something else which you want) and run query

Sample Image

5- and Run you own query which was slow as I mentiond in my question... after these steps my query run in 4 second instead of 38

SQL Server bad performance with a lot of ORs and repeating criteria using UPPER()

If you are able to make changes to the query then remove the UPPER - This can be straightforwardly removed if you are on a case insensitive collation (by far the most common case) - otherwise you will need to add logic to ensure the values are uppercased before being added to the query. UPPER is not constant folded and can give worse plans than simple string literals as shown in the various examples below.

Example Data

CREATE TABLE [Table]
(
[COL1] VARCHAR(20),
[COL2] VARCHAR(10),
PRIMARY KEY ([COL1],[COL2])
)

INSERT INTO [Table]
SELECT TOP 100 'CONST_VALUE',  CONCAT('v', ROW_NUMBER() OVER (ORDER BY @@SPID))
FROM sys.all_columns

Query 1

SELECT *
FROM [Table]
WHERE(
      ([COL1] = 'CONST_VALUE' AND [COL2] = 'V1') OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = 'V1') OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = 'V4') 
     )

The execution plan for this has an index seek operator. Looking at the properties of the plan shows the seek actually contains two different multi column seek predicates (not three seeks. it would be an error to perform the 'V1' seek twice and return those rows twice even though it appears in the WHERE clause twice)

Sample Image

Query 2

SELECT *
FROM [Table] 
WHERE(
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('v1')) OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('V1')) OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('v2')) 
     )

This execution plan looks promising but on closer inspection the seek is only on the single column COL1 - as all rows in the table have the value 'CONST_VALUE' in this case the seek achieves nothing and all the work is done with a residual predicate.

Sample Image

Query 3

SELECT *
FROM [Table] WITH (FORCESEEK)
WHERE(
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('v1')) OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('V1')) OR
      ([COL1] = 'CONST_VALUE' AND [COL2] = UPPER('v2')) 
     )

This is the same as previous but with a FORCESEEK hint added. The results of UPPER are not constant folded at compile time for some reason so it adds extra operators to the plan to evaluate the UPPER and then collapse down the identical results to perform the two needed multi column index seeks.

Sample Image

Query 4

SELECT *
FROM [Table] 
WHERE(
      ([COL1] = UPPER('CONST_VALUE') AND [COL2] = UPPER('v1')) OR
      ([COL1] = UPPER('CONST_VALUE') AND [COL2] = UPPER('V1')) OR
      ([COL1] = UPPER('CONST_VALUE') AND [COL2] = UPPER('v2')) 
     )

Now SQL Server gives up and just gives a scan

Sample Image

Query 5

SELECT *
FROM [Table] 
WHERE [COL1] = UPPER('CONST_VALUE') AND  
(
      [COL2] = UPPER('v1') OR
      [COL2] = UPPER('V1') OR
      [COL2] = UPPER('v2') 
)

This rewrite gives the same execution plan as Query 2 - with a seek on Col1 and residual predicate on Col2, this is not useful with my example data but would be with more realistic cases.

Query 6

SELECT *
FROM sys.all_objects
where 'v1' <> 'v1'

SQL Server detects the contradiction at compile time and gives a very simple plan

Sample Image

Query 7

SELECT *
FROM sys.all_objects
where UPPER('v1') <> UPPER('v1')

Despite the fact that the expressions are deterministic and have exactly the same input values no contradiction detection occurs

Sample Image

How can I prove that the following T-SQL is bad for performance?

There is dynamic SQL, so no cache plans, meaning plans generated every time

Not necessarily true. Dynamic SQL can (and does) use cached plans just as well as static SQL. For dynamic search conditions resolving to dynamic SQL is oft the right answer. See Dynamic Search Conditions in T-SQL for more details.

There is an INSERT SELECT pattern, so table locking is more likely to be an issue.

Not necessarily true, specially with a @tempTable

have re-written some of the Stored Procedures in this way instead

Using multiple OR conditions like that is an anti-pattern. You are forcing the query optimizer to come up with a plan that works for any value of all those parameters. Usually the only solution is a scan, ignoring any index. The original code was better.

Can anyone offer ... any better ways of proving my suspicions?

Yes. Measure. Use a methodology like Waits and Queues. Don't relly on your intuition. Find the bottlenecks and address them accordingly.

Is SQL in Bad For Performance