This Select Query Takes 180 Seconds to Finish

This SELECT query takes 180 seconds to finish

(Posting my comment as an answer as apparently it did make a difference!)

Any difference if you change the IN
to =?

If anyone wants to investigate this further I've just done a test and found it very easy to reproduce.

Create Table

CREATE TABLE `filler` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
)

Create Procedure

CREATE PROCEDURE `prc_filler`(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END

Populate Table

  call prc_filler(5000)

Query 1

SELECT id 
FROM filler
WHERE id = (SELECT MAX(id) FROM filler WHERE id =
( SELECT MIN(id)
FROM filler
WHERE id between 2000 and 3000
)
)

Equals Explain Output http://img689.imageshack.us/img689/5592/equals.png

Query 2 (same problem)

SELECT id 
FROM filler
WHERE id in (SELECT MAX(id) FROM filler WHERE id in
( SELECT MIN(id)
FROM filler
WHERE id between 2000 and 3000
)
)

In Explain Output http://img291.imageshack.us/img291/8129/52037513.png

Why does this WHERE clause make my query 180 times slower?

After the chat room, and actually creating tables/columns to match the query, I've come up with the following query.

I have started my inner-most query to be on the sex, product (for shop_id) and favorites table. Since you described that ProductX at ShopA = Product ID = 1 but same ProductX at ShopB = Product ID = 2 (example only), each product is ALWAYS unique per shop and never duplicated. That said, I can get the product and shop_id WITH the count of favorites (if any) at this query, yet group on just the product_id .. as shop_id won't change per product I am using MAX(). Since you are always looking by a date of "yesterday" and gender (sex=0 female), I would have the SEX table indexed on ( date, sex, product_id )... I would guess you are not adding 1000's of items every day... Products obviously would have an index on product_id (primary key), and favorites SHOULD have an index on product_id.

From that result (alias "sxFav") we can then do a direct join to the sex and products table by that "Product_ID" to get any additional information you may want, such as name of shop, date product added, product description, etc. This result is then ordered by the shop_id the product is being sold from, date and finally product ID (but you may consider grabbing a description column at inner query and using that as sort-by). This results in alias "PreQuery".

With the order being all proper by shop, we can now add the @MySQLVariable references to get each product assigned a row number similar to how you originally attempted. However, only reset back to 1 when a shop ID changes.

SELECT 
PreQuery.*,
@num := IF( @current_shop_id = PreQuery.shop_id, @num +1, 1 ) AS RowPerShop,
@current_shop_id := PreQuery.shop_id AS shop_dummy
from
( SELECT
sxFav.product_id,
sxFav.shop_id,
sxFav.Favorites_Count
from
( SELECT
sex.product_id,
MAX( p.shop_id ) shop_id,
SUM( CASE WHEN F.current = 1 AND F.closeted = 1 THEN 1
WHEN F.current = 1 AND F.closeted = 0 THEN -1
ELSE 0 END ) AS favorites_count
from
sex
JOIN products p
ON sex.Product_ID = p.Product_ID
LEFT JOIN Favorites F
ON sex.product_id = F.product_ID
where
sex.date >= subdate( now(), interval 1 day)
and sex.sex = 0
group by
sex.product_id ) sxFav

JOIN sex
ON sxFav.Product_ID = sex.Product_ID

JOIN products p
ON sxFav.Product_ID = p.Product_ID
order by
sxFav.shop_id,
sex.date,
sxFav.product_id ) PreQuery,

( select @num :=0,
@current_shop_id := 0 ) as SQLVars

Now, if you are looking for specific "paging" information (such as 7 entries per shop), wrap the ENTIRE query above into something like...

select * from ( entire query above ) where RowPerShop between 1 and 7

(or between 8 and 14, 15 and 21, etc as needed)
or even

RowPerShop between RowsPerPage*PageYouAreShowing and RowsPerPage*(PageYouAreShowing +1)

Oddly slow behaviour IN clause SQL

I would reverse it... start with your list of IDs that are qualified, then use THAT to join to the original table to get the data. WHERE IN Sub-selects are always a root of performance hits

SELECT
T1.id,
T1.a,
T1.b,
T1.c,
T1.d
FROM
( SELECT id, count(*)
FROM table1
WHERE a = 1
AND time >= '2012-03-13 10:41:34.8431'
group by id
having count(*) > 4 ) as PreQualified
JOIN table1 T1
ON PreQualified.ID = T1.ID
AND T1.a = 1
AND T1.time >= '2012-03-13 10:41:34.8431'

I re-applied same criteria to the JOIN clause since I don't know the rest of the data.. as I don't know if you could have an ID that has prior times than the time indicated and want to exclude those too, but retain the premise that the ID must AT LEAST have "a = 1" and be on/after the time period specified.

By starting with a pre-filtered/qualified list of just IDs, you don't have to join every other ID, then throw them out when their total count is not enough.

Per input from Ami, I'm using his context to clarify dependency queries...

You should view the EXPLAIN result on your queries to determine if your query is running the subquery for each row of the outer query, which will appear in your EXPLAIN result as a DEPENDENT QUERY.

You can often turn a dependent subquery into a DERIVED table by joining on it, as is being done here.

Why would an IN condition be slower than = in sql?

Summary: This is a known problem in MySQL and was fixed in MySQL 5.6.x. The problem is due to a missing optimization when a subquery using IN is incorrectly indentified as dependent subquery instead of an independent subquery.


When you run EXPLAIN on the original query it returns this:


1 'PRIMARY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
2 'DEPENDENT SUBQUERY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
3 'DEPENDENT SUBQUERY' 'question_law' 'ALL' '' '' '' '' 10040 'Using where'

When you change IN to = you get this:


1 'PRIMARY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
2 'SUBQUERY' 'question_law_version' 'ALL' '' '' '' '' 10148 'Using where'
3 'SUBQUERY' 'question_law' 'ALL' '' '' '' '' 10040 'Using where'

Each dependent subquery is run once per row in the query it is contained in, whereas the subquery is run only once. MySQL can sometimes optimize dependent subqueries when there is a condition that can be converted to a join but here that is not the case.

Now this of course leaves the question of why MySQL believes that the IN version needs to be a dependent subquery. I have made a simplified version of the query to help investigate this. I created two tables 'foo' and 'bar' where the former contains only an id column, and the latter contains both an id and a foo id (though I didn't create a foreign key constraint). Then I populated both tables with 1000 rows:

CREATE TABLE foo (id INT PRIMARY KEY NOT NULL);
CREATE TABLE bar (id INT PRIMARY KEY, foo_id INT NOT NULL);

-- populate tables with 1000 rows in each

SELECT id
FROM foo
WHERE id IN
(
SELECT MAX(foo_id)
FROM bar
);

This simplified query has the same problem as before - the inner select is treated as a dependent subquery and no optimization is performed, causing the inner query to be run once per row. The query takes almost one second to run. Changing the IN to = again allows the query to run almost instantly.

The code I used to populate the tables is below, in case anyone wishes to reproduce the results.

CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$

DELIMITER ;

CALL prc_filler(1000);

INSERT foo SELECT id FROM filler;
INSERT bar SELECT id, id FROM filler;

slow query time for mysql select

That query is going to seq scan the entire table because there's really not a filter on it that doesn't derive from a calculated value (distance). Adding a where clause that filters by an indexed column will definitely help eliminate some overhead, but only if the DB believes that there's enough data in the table to warrant using the index instead of the table. Thus, make sure you analyze it as well.

The logic with the distance being in the query is incredibly ugly, but I can see why you wouldn't want to transmit 700k+ rows over the wire every time you select from the table. It looks like you may be doing spatial calculations and might be wise to investigate spatial data types and indexes.

Ed: Also, your question about a where clause...

select fields [aggregate fields]
from table
where where clause
group by fields
having having clause

Mysql Exists vs IN -- correlated subquery vs subquery?

This is a RDBMS-agnostic answer, but may help nonetheless. In my understanding, the correlated (aka, dependent) subquery is perhaps the most often falsely accused culprit for bad performance.

The problem (as it is most often described) is that it processes the inner query for every row of the outer query. Therefore, if the outer query returns 1,000 rows, and the inner query returns 10,000, then your query has to slog through 10,000,000 rows (outer×inner) to produce a result. Compared to the 11,000 rows (outer+inner) from a non-correlated query over the same result sets, that ain't good.

However, this is just the worst case scenario. In many cases, the DBMS will be able to exploit indexes to drastically reduce the rowcount. Even if only the inner query can use an index, the 10,000 rows becomes ~13 seeks, which drops the total down to 13,000.

The exists operator can stop processing rows after the first, cutting down the query cost further, especially when most outer rows match at least one inner row.

In some rare cases, I have seen SQL Server 2008R2 optimise correlated subqueries to a merge join (which traverses both sets only once - best possible scenario) where a suitable index can be found in both inner and outer queries.

The real culprit for bad performance is not necessarily correlated subqueries, but nested scans.

Serious MySQL query performance issues after adding condition

Please provide SHOW CREATE TABLE.

I would hope to see these composite indexes:

`val`: (entityId, attributeId)   -- order is not critical

Alas, because code is LONGTEXT, this is not possible for entity: INDEX(type, code, entityId). Hence this will not be very efficient:

        SELECT  entityId
from entity
where code = v9.Value
and type = 97
limit 1

I see LIMIT with an ORDER BY -- do you care which value you get?

Probably that would be better written as

    WHERE EXISTS ( SELECT 1 FROM entity
WHERE entityID = e3.entityID
AND code = v9.Value
AND type = 97 )

(Are you sure about the mixture of e3 and v9?)

Wrapping...

This forces the LEFT JOIN to become JOIN. And it gets rid of the then inner ORDER BY.

Then the Optimizer probably decides it is best to start with 68e9145e-43eb-4581-9727-4212be41bef5, which I call val AS v11:

JOIN val AS v11 ON (v11.entityId = e2.id
and v11.attributeId = 1614)
AND v11.Value = 'bar2')

If this is an EAV table, then all it does is verify that [, 1514] has value 'bar2'. This does not seem like a sensible test.

in addition to my former recommendation.

I would prefer EXPLAIN SELECT ....

EAV

Assuming val is a traditional EAV table, this would probably be much better:

CREATE TABLE `val` (
`attributeId` int(11) NOT NULL,
`entityId` int(11) NOT NULL,
`Value` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PRIMARY KEY(`entityId`,`attributeId`),
KEY `IX_val_attributeId` (`attributeId`),
) ENGINE=InnoDB AUTO_INCREMENT=2325375 DEFAULT CHARSET=latin1

The two IDs have no practical use (unless I am missing something). If you are forced to use them because of a framework, that is unfortunate. Promoting (entityId, attributeId) to be the PK makes fetching value a little faster.

There is no useful way to include a LONGTEXT in any index, so some of my previous suggestions need changing.



Related Topics



Leave a reply



Submit