Why Would an in Condition Be Slower Than "=" in Sql

Why would an IN condition be slower than = in sql?

Summary: This is a known problem in MySQL and was fixed in MySQL 5.6.x. The problem is due to a missing optimization when a subquery using IN is incorrectly indentified as dependent subquery instead of an independent subquery.


When you run EXPLAIN on the original query it returns this:


1 'PRIMARY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
2 'DEPENDENT SUBQUERY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
3 'DEPENDENT SUBQUERY' 'question_law' 'ALL' '' '' '' '' 10040 'Using where'

When you change IN to = you get this:


1 'PRIMARY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
2 'SUBQUERY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
3 'SUBQUERY' 'question_law' 'ALL' '' '' '' '' 10040 'Using where'

Each dependent subquery is run once per row in the query it is contained in, whereas the subquery is run only once. MySQL can sometimes optimize dependent subqueries when there is a condition that can be converted to a join but here that is not the case.

Now this of course leaves the question of why MySQL believes that the IN version needs to be a dependent subquery. I have made a simplified version of the query to help investigate this. I created two tables 'foo' and 'bar' where the former contains only an id column, and the latter contains both an id and a foo id (though I didn't create a foreign key constraint). Then I populated both tables with 1000 rows:

CREATE TABLE foo (id INT PRIMARY KEY NOT NULL);
CREATE TABLE bar (id INT PRIMARY KEY, foo_id INT NOT NULL);

-- populate tables with 1000 rows in each

SELECT id
FROM foo
WHERE id IN
(
SELECT MAX(foo_id)
FROM bar
);

This simplified query has the same problem as before - the inner select is treated as a dependent subquery and no optimization is performed, causing the inner query to be run once per row. The query takes almost one second to run. Changing the IN to = again allows the query to run almost instantly.

The code I used to populate the tables is below, in case anyone wishes to reproduce the results.

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$

DELIMITER ;

CALL prc_filler(1000);

INSERT foo SELECT id FROM filler;
INSERT bar SELECT id, id FROM filler;

SQL IN clause slower than individual queries

This is a known deficiency in MySQL.

It is often true that using UNION performs better than a range query like the one you show. MySQL doesn't employ indexes very intelligently for expressions using IN (...). A similar hole exists in the optimizer for boolean expressions with OR.

See http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/ for some explanation and detailed benchmarks.

The optimizer is being improved all the time. A deficiency in one version of MySQL may be improved in a subsequent version. So it's worth testing your queries on different versions.

It is also advantageous to use UNION ALL instead of simply UNION. Both queries use a temporary table to store results, but the difference is that UNION applies DISTINCT to the result set, which incurs an additional un-indexed sort.

IN vs OR in the SQL WHERE clause

I assume you want to know the performance difference between the following:

WHERE foo IN ('a', 'b', 'c')
WHERE foo = 'a' OR foo = 'b' OR foo = 'c'

According to the manual for MySQL if the values are constant IN sorts the list and then uses a binary search. I would imagine that OR evaluates them one by one in no particular order. So IN is faster in some circumstances.

The best way to know is to profile both on your database with your specific data to see which is faster.

I tried both on a MySQL with 1000000 rows. When the column is indexed there is no discernable difference in performance - both are nearly instant. When the column is not indexed I got these results:

SELECT COUNT(*) FROM t_inner WHERE val IN (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);
1 row fetched in 0.0032 (1.2679 seconds)

SELECT COUNT(*) FROM t_inner WHERE val = 1000 OR val = 2000 OR val = 3000 OR val = 4000 OR val = 5000 OR val = 6000 OR val = 7000 OR val = 8000 OR val = 9000;
1 row fetched in 0.0026 (1.7385 seconds)

So in this case the method using OR is about 30% slower. Adding more terms makes the difference larger. Results may vary on other databases and on other data.

Sql View with WHERE clause runs slower than a raw query

Why this happens? Does the view retrieve all records and then apply Where clause?

Yes, in this particular case, SQL Server will first execute the original underlying query, and then apply a WHERE filter on top of that intermediate result.

Is there a way to speed up the second query? (without applying indexes)

A SQL view generally performs as well as the underlying query. So, if Barcode is a good way to filter off many records, then adding an index to Barcode is the way to go. Other than this, there is not much you can do to speed up the view.

One option would be to create a materialized view, which is basically just a table whose data is generated by your view's query. Selecting all records from a materialized view, with no additional restrictions, should have a speed limited only by the time of data transfer.

IN vs OR of Oracle, which faster?

IN is preferable to OR -- OR is a notoriously bad performer, and can cause other issues that would require using parenthesis in complex queries.

Better option than either IN or OR, is to join to a table containing the values you want (or don't want). This table for comparison can be derived, temporary, or already existing in your schema.

SQL 'like' vs '=' performance

See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx

Quote from there:

the rules for index usage with LIKE
are loosely like this:

  • If your filter criteria uses equals =
    and the field is indexed, then most
    likely it will use an INDEX/CLUSTERED
    INDEX SEEK

  • If your filter criteria uses LIKE,
    with no wildcards (like if you had a
    parameter in a web report that COULD
    have a % but you instead use the full
    string), it is about as likely as #1
    to use the index. The increased cost
    is almost nothing.

  • If your filter criteria uses LIKE, but
    with a wildcard at the beginning (as
    in Name0 LIKE '%UTER') it's much less
    likely to use the index, but it still
    may at least perform an INDEX SCAN on
    a full or partial range of the index.

  • HOWEVER, if your filter criteria uses
    LIKE, but starts with a STRING FIRST
    and has wildcards somewhere AFTER that
    (as in Name0 LIKE 'COMP%ER'), then SQL
    may just use an INDEX SEEK to quickly
    find rows that have the same first
    starting characters, and then look
    through those rows for an exact match.


(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)

Is SQL IN bad for performance?

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

SQL Server query is very slow when number of items inside IN clause more than 4

First, fix the query. Simple rule: Never use commas in the FROM clause.

select t1.id, t2.id, t1.name, t2.name 
from table1 t1 join
table2 t2
on t2.id = t1.ref_id left join
table3 t3
on t2.id = t3.id
where t1.ref_id in ('id1', 'id2', 'id3', 'id4', 'id5', ...);

Based on the query, you have no need for table3 -- unless you care about duplicate rows. I would remove it.

Then, you need to consider indexes. I would suggest table1(ref_id, id, name) and table2(id, name).

Also, if ref_id is really a number, then don't put single quotes around the values in the list. Mixing strings and numbers can confuse the optimizer.

SELECT COUNT(*) is slow, even with where clause

InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages. In order to do a range scan you still have to scan through all of the potentially wide rows in data pages; note that this table contains a TEXT column.

Two things I would try:

  1. run optimize table. This will ensure that the data pages are physically stored in sorted order. This could conceivably speed up a range scan on a clustered primary key.
  2. create an additional non-primary index on just the change_event_id column. This will store a copy of that column in index pages which be much faster to scan. After creating it, check the explain plan to make sure it's using the new index.

(you also probably want to make the change_event_id column bigint unsigned if it's incrementing from zero)



Related Topics



Leave a reply



Submit