This Select Query Takes 180 Seconds to Finish

This SELECT query takes 180 seconds to finish

(Posting my comment as an answer as apparently it did make a difference!)

Any difference if you change the IN
to =?

If anyone wants to investigate this further I've just done a test and found it very easy to reproduce.

Create Table

CREATE TABLE `filler` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
)

Create Procedure

CREATE PROCEDURE `prc_filler`(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END

Populate Table

  call prc_filler(5000)

Query 1

SELECT id 
FROM filler 
WHERE id =  (SELECT MAX(id) FROM filler  WHERE id =   
 ( SELECT MIN(id) 
    FROM filler
    WHERE id between 2000 and 3000
    )
  )

Equals Explain Output http://img689.imageshack.us/img689/5592/equals.png

Query 2 (same problem)

SELECT id 
FROM filler 
WHERE id in  (SELECT MAX(id) FROM filler  WHERE id in   
 ( SELECT MIN(id) 
    FROM filler
    WHERE id between 2000 and 3000
    )
  )

In Explain Output http://img291.imageshack.us/img291/8129/52037513.png

Why does this WHERE clause make my query 180 times slower?

After the chat room, and actually creating tables/columns to match the query, I've come up with the following query.

I have started my inner-most query to be on the sex, product (for shop_id) and favorites table. Since you described that ProductX at ShopA = Product ID = 1 but same ProductX at ShopB = Product ID = 2 (example only), each product is ALWAYS unique per shop and never duplicated. That said, I can get the product and shop_id WITH the count of favorites (if any) at this query, yet group on just the product_id .. as shop_id won't change per product I am using MAX(). Since you are always looking by a date of "yesterday" and gender (sex=0 female), I would have the SEX table indexed on ( date, sex, product_id )... I would guess you are not adding 1000's of items every day... Products obviously would have an index on product_id (primary key), and favorites SHOULD have an index on product_id.

From that result (alias "sxFav") we can then do a direct join to the sex and products table by that "Product_ID" to get any additional information you may want, such as name of shop, date product added, product description, etc. This result is then ordered by the shop_id the product is being sold from, date and finally product ID (but you may consider grabbing a description column at inner query and using that as sort-by). This results in alias "PreQuery".

With the order being all proper by shop, we can now add the @MySQLVariable references to get each product assigned a row number similar to how you originally attempted. However, only reset back to 1 when a shop ID changes.

SELECT 
      PreQuery.*,
      @num := IF( @current_shop_id = PreQuery.shop_id, @num +1, 1 ) AS RowPerShop, 
      @current_shop_id := PreQuery.shop_id AS shop_dummy 
   from 
      ( SELECT 
              sxFav.product_id, 
              sxFav.shop_id, 
              sxFav.Favorites_Count
           from 
              ( SELECT 
                      sex.product_id,
                      MAX( p.shop_id ) shop_id,
                      SUM( CASE WHEN F.current = 1 AND F.closeted = 1 THEN 1 
                                WHEN F.current = 1 AND F.closeted = 0 THEN -1 
                                ELSE 0 END ) AS favorites_count 
                   from 
                      sex
                         JOIN products p
                            ON sex.Product_ID = p.Product_ID
                         LEFT JOIN Favorites F 
                            ON sex.product_id = F.product_ID 
                   where 
                          sex.date >= subdate( now(), interval 1 day) 
                      and sex.sex = 0 
                   group by 
                      sex.product_id ) sxFav 

              JOIN sex 
                 ON sxFav.Product_ID = sex.Product_ID

              JOIN products p
                 ON sxFav.Product_ID = p.Product_ID
      order by 
         sxFav.shop_id, 
         sex.date, 
         sxFav.product_id ) PreQuery,

     ( select @num :=0, 
              @current_shop_id := 0 ) as SQLVars

Now, if you are looking for specific "paging" information (such as 7 entries per shop), wrap the ENTIRE query above into something like...

select * from ( entire query above ) where RowPerShop between 1 and 7

(or between 8 and 14, 15 and 21, etc as needed)
or even

RowPerShop between RowsPerPage*PageYouAreShowing and RowsPerPage*(PageYouAreShowing +1)

Oddly slow behaviour IN clause SQL

I would reverse it... start with your list of IDs that are qualified, then use THAT to join to the original table to get the data. WHERE IN Sub-selects are always a root of performance hits

SELECT
      T1.id, 
      T1.a, 
      T1.b, 
      T1.c,
      T1.d 
   FROM
      ( SELECT id, count(*)
           FROM table1
           WHERE a = 1
             AND time >= '2012-03-13 10:41:34.8431'
           group by id
           having count(*) > 4 ) as PreQualified
      JOIN table1 T1
         ON PreQualified.ID = T1.ID
        AND T1.a = 1
        AND T1.time >= '2012-03-13 10:41:34.8431'

I re-applied same criteria to the JOIN clause since I don't know the rest of the data.. as I don't know if you could have an ID that has prior times than the time indicated and want to exclude those too, but retain the premise that the ID must AT LEAST have "a = 1" and be on/after the time period specified.

By starting with a pre-filtered/qualified list of just IDs, you don't have to join every other ID, then throw them out when their total count is not enough.

Per input from Ami, I'm using his context to clarify dependency queries...

You should view the EXPLAIN result on your queries to determine if your query is running the subquery for each row of the outer query, which will appear in your EXPLAIN result as a DEPENDENT QUERY.

You can often turn a dependent subquery into a DERIVED table by joining on it, as is being done here.

Why would an IN condition be slower than = in sql?

Summary: This is a known problem in MySQL and was fixed in MySQL 5.6.x. The problem is due to a missing optimization when a subquery using IN is incorrectly indentified as dependent subquery instead of an independent subquery.

When you run EXPLAIN on the original query it returns this:


1  'PRIMARY'             'question_law_version'  'ALL'  ''  ''  ''  ''  10148  'Using where'
2  'DEPENDENT SUBQUERY'  'question_law_version'  'ALL'  ''  ''  ''  ''  10148  'Using where'
3  'DEPENDENT SUBQUERY'  'question_law'          'ALL'  ''  ''  ''  ''  10040  'Using where'

When you change IN to = you get this:


1  'PRIMARY'   'question_law_version'  'ALL'  ''  ''  ''  ''  10148  'Using where'
2  'SUBQUERY'  'question_law_version'  'ALL'  ''  ''  ''  ''  10148  'Using where'
3  'SUBQUERY'  'question_law'          'ALL'  ''  ''  ''  ''  10040  'Using where'

Each dependent subquery is run once per row in the query it is contained in, whereas the subquery is run only once. MySQL can sometimes optimize dependent subqueries when there is a condition that can be converted to a join but here that is not the case.

Now this of course leaves the question of why MySQL believes that the IN version needs to be a dependent subquery. I have made a simplified version of the query to help investigate this. I created two tables 'foo' and 'bar' where the former contains only an id column, and the latter contains both an id and a foo id (though I didn't create a foreign key constraint). Then I populated both tables with 1000 rows:

CREATE TABLE foo (id INT PRIMARY KEY NOT NULL);
CREATE TABLE bar (id INT PRIMARY KEY, foo_id INT NOT NULL);

-- populate tables with 1000 rows in each

SELECT id
FROM foo
WHERE id IN
(
    SELECT MAX(foo_id)
    FROM bar
);

This simplified query has the same problem as before - the inner select is treated as a dependent subquery and no optimization is performed, causing the inner query to be run once per row. The query takes almost one second to run. Changing the IN to = again allows the query to run almost instantly.

The code I used to populate the tables is below, in case anyone wishes to reproduce the results.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

CALL prc_filler(1000);

INSERT foo SELECT id FROM filler;
INSERT bar SELECT id, id FROM filler;

slow query time for mysql select

That query is going to seq scan the entire table because there's really not a filter on it that doesn't derive from a calculated value (distance). Adding a where clause that filters by an indexed column will definitely help eliminate some overhead, but only if the DB believes that there's enough data in the table to warrant using the index instead of the table. Thus, make sure you analyze it as well.

The logic with the distance being in the query is incredibly ugly, but I can see why you wouldn't want to transmit 700k+ rows over the wire every time you select from the table. It looks like you may be doing spatial calculations and might be wise to investigate spatial data types and indexes.

Ed: Also, your question about a where clause...

select fields [aggregate fields]
from table
where where clause
group by fields
having having clause

Mysql Exists vs IN -- correlated subquery vs subquery?

This is a RDBMS-agnostic answer, but may help nonetheless. In my understanding, the correlated (aka, dependent) subquery is perhaps the most often falsely accused culprit for bad performance.

The problem (as it is most often described) is that it processes the inner query for every row of the outer query. Therefore, if the outer query returns 1,000 rows, and the inner query returns 10,000, then your query has to slog through 10,000,000 rows (outer×inner) to produce a result. Compared to the 11,000 rows (outer+inner) from a non-correlated query over the same result sets, that ain't good.

However, this is just the worst case scenario. In many cases, the DBMS will be able to exploit indexes to drastically reduce the rowcount. Even if only the inner query can use an index, the 10,000 rows becomes ~13 seeks, which drops the total down to 13,000.

The exists operator can stop processing rows after the first, cutting down the query cost further, especially when most outer rows match at least one inner row.

In some rare cases, I have seen SQL Server 2008R2 optimise correlated subqueries to a merge join (which traverses both sets only once - best possible scenario) where a suitable index can be found in both inner and outer queries.

The real culprit for bad performance is not necessarily correlated subqueries, but nested scans.

Serious MySQL query performance issues after adding condition

Please provide SHOW CREATE TABLE.

I would hope to see these composite indexes:

`val`: (entityId, attributeId)   -- order is not critical

Alas, because code is LONGTEXT, this is not possible for entity: INDEX(type, code, entityId). Hence this will not be very efficient:

        SELECT  entityId
            from  entity
            where  code = v9.Value
              and  type = 97
            limit  1

I see LIMIT with an ORDER BY -- do you care which value you get?

Probably that would be better written as

    WHERE EXISTS ( SELECT 1 FROM entity
                WHERE entityID = e3.entityID
                  AND code     = v9.Value
                  AND type = 97 )

(Are you sure about the mixture of e3 and v9?)

Wrapping...

This forces the LEFT JOIN to become JOIN. And it gets rid of the then inner ORDER BY.

Then the Optimizer probably decides it is best to start with 68e9145e-43eb-4581-9727-4212be41bef5, which I call val AS v11:

JOIN val AS v11 ON (v11.entityId = e2.id
             and  v11.attributeId = 1614)
             AND  v11.Value = 'bar2')

If this is an EAV table, then all it does is verify that [, 1514] has value 'bar2'. This does not seem like a sensible test.

in addition to my former recommendation.

I would prefer EXPLAIN SELECT ....

EAV

Assuming val is a traditional EAV table, this would probably be much better:

CREATE TABLE `val` (
  `attributeId` int(11) NOT NULL,
  `entityId` int(11) NOT NULL,
  `Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
  PRIMARY KEY(`entityId`,`attributeId`),
  KEY `IX_val_attributeId` (`attributeId`),
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1

The two IDs have no practical use (unless I am missing something). If you are forced to use them because of a framework, that is unfortunate. Promoting (entityId, attributeId) to be the PK makes fetching value a little faster.

There is no useful way to include a LONGTEXT in any index, so some of my previous suggestions need changing.

This Select Query Takes 180 Seconds to Finish