How to Further Optimize a Derived Table Query Which Performs Better Than the Joined Equivalent

How can I further optimize a derived table query which performs better than the JOINed equivalent?

Well, I found a solution. It took a lot of experimentation, and I think a good bit of blind luck, but here it is:

CREATE TABLE magic ENGINE=MEMORY
SELECT
s.shop_id AS shop_id,
s.id AS shift_id,
st.dow AS dow,
st.start AS start,
st.end AS end,
su.user_id AS manager_id
FROM shifts s
JOIN shift_times st ON s.id = st.shift_id
JOIN shifts_users su ON s.id = su.shift_id
JOIN shift_positions sp ON su.shift_position_id = sp.id AND sp.level = 1

ALTER TABLE magic ADD INDEX (shop_id, dow);

CREATE TABLE tickets_extra ENGINE=MyISAM
SELECT
t.id AS ticket_id,
(
SELECT m.manager_id
FROM magic m
WHERE DAYOFWEEK(t.created) = m.dow
AND TIME(t.created) BETWEEN m.start AND m.end
AND m.shop_id = t.shop_id
) AS manager_created,
(
SELECT m.manager_id
FROM magic m
WHERE DAYOFWEEK(t.resolved) = m.dow
AND TIME(t.resolved) BETWEEN m.start AND m.end
AND m.shop_id = t.shop_id
) AS manager_resolved
FROM tickets t;
DROP TABLE magic;

Lengthy Explanation

Now, I'll explain why this works, and my relative though process and steps to get here.

First, I knew the query I was trying was suffering because of the huge derived table, and the subsequent JOINs onto this. I was taking my well-indexed tickets table and joining all the shift_times data onto it, then letting MySQL chew on that while it attempts to join the shifts and shift_positions table. This derived behemoth would be up to a 2 million row unindexed mess.

Now, I knew this was happening. The reason I was going down this road though was because the "proper" way to do this, using strictly JOINs was taking an even longer amount of time. This is due to the nasty bit of chaos required to determine who the manager of a given shift is. I have to join down to shift_times to find out what the correct shift even is, while simultaneously joining down to shift_positions to figure out the user's level. I don't think the MySQL optimizer handles this very well, and ends up creating a HUGE monstrosity of a temporary table of the joins, then filtering out what doesn't apply.

So, as the derived table seemed to be the "way to go" I stubbornly persisted in this for a while. I tried punting it down into a JOIN clause, no improvement. I tried creating a temporary table with the derived table in it, but again it was too slow as the temp table was unindexed.

I came to realize that I had to handle this calculation of shift, times, positions sanely. I thought, maybe a VIEW would be the way to go. What if I created a VIEW that contained this information: (shop_id, shift_id, dow, start, end, manager_id). Then, I would simply have to join the tickets table by shop_id and the whole DAYOFWEEK/TIME calculation, and I'd be in business. Of course, I failed to remember that MySQL handles VIEWs rather assily. It doesn't materialize them at all, it simply runs the query you would have used to get the view for you. So by joining tickets onto this, I was essentially running my original query - no improvement.

So, instead of a VIEW I decided to use a TEMPORARY TABLE. This worked well if I only fetched one of the managers (created or resolved) at a time, but it was still pretty slow. Also, I found out that with MySQL you can't refer to the same table twice in the same query (I would have to join my temporary table twice to be able to differentiate between manager_created and manager_resolved). This is a big WTF, as I can do it as long as I don't specify "TEMPORARY" - this is where the CREATE TABLE magic ENGINE=MEMORY came into play.

With this pseudo temporary table in hand, I tried my JOIN for just manager_created again. It performed well, but still rather slow. Yet, when I JOINed again to get manager_resolved in the same query the query time ticked back up into the stratosphere. Looking at the EXPLAIN showed the full table scan of tickets (rows ~2mln), as expected, and the JOINs onto the magic table at ~2,087 each. Again, I seemed to be running into fail.

I now began to think about how to avoid the JOINs altogether and that's when I found some obscure ancient message board post where someone suggested using subselects (can't find the link in my history). This is what led to the second SELECT query shown above (the tickets_extra creation one). In the case of selecting just a single manager field, it performed well, but again with both it was crap. I looked at the EXPLAIN and saw this:

*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 173825
Extra:
*************************** 2. row ***************************
id: 3
select_type: DEPENDENT SUBQUERY
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2037
Extra: Using where
*************************** 3. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2037
Extra: Using where
3 rows in set (0.00 sec)

Ack, the dreaded DEPENDENT SUBQUERY. It's often suggested to avoid these, as MySQL will usually execute them in an outside-in fashion, executing the inner query for every row of the outer. I ignored this, and wondered: "Well... what if I just indexed this stupid magic table?". Thus, the ADD index (shop_id, dow) was born.

Check this out:

mysql> CREATE TABLE magic ENGINE=MEMORY
<snip>
Query OK, 3220 rows affected (0.40 sec)

mysql> ALTER TABLE magic ADD INDEX (shop_id, dow);
Query OK, 3220 rows affected (0.02 sec)

mysql> CREATE TABLE tickets_extra ENGINE=MyISAM
<snip>
Query OK, 1933769 rows affected (24.18 sec)

mysql> drop table magic;
Query OK, 0 rows affected (0.00 sec)

Now THAT'S what I'm talkin' about!

Conclusion

This is definitely the first time I've created a non-TEMPORARY table on the fly, and INDEXed it on the fly, simply to do a single query efficiently. I guess I always assumed that adding an index on the fly is a prohibitively expensive operation. (Adding an index on my tickets table of 2mln rows can take over an hour). Yet, for a mere 3,000 rows this is a cakewalk.

Don't be afraid of DEPENDENT SUBQUERIES, creating TEMPORARY tables that really aren't, indexing on the fly, or aliens. They can all be good things in the right situation.

Thanks for all the help StackOverflow. :-D

Which one have better performance : Derived Tables or Temporary Tables

Derived table is a logical construct.

It may be stored in the tempdb, built at runtime by reevaluating the underlying statement each time it is accessed, or even optimized out at all.

Temporary table is a physical construct. It is a table in tempdb that is created and populated with the values.

Which one is better depends on the query they are used in, the statement that is used to derive a table, and many other factors.

For instance, CTE (common table expressions) in SQL Server can (and most probably will) be reevaluated each time they are used. This query:

WITH    q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q

will most probably yield two different NEWID()'s.

In this case, a temporary table should be used since it guarantees that its values persist.

On the other hand, this query:

SELECT  *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM master
) q
WHERE rn BETWEEN 80 AND 100

is better with a derived table, because using a temporary table will require fetching all values from master, while this solution will just scan the first 100 records using the index on id.

How do I tell the MySQL Optimizer to use the index on a derived table?

There is a solution to this in MySQL Server 5.6 - the preview release (at the time of this writing).

http://dev.mysql.com/doc/refman/5.6/en/from-clause-subquery-optimization.html

Although, I'm not sure if the MySQL Optimizer will re-use indexes that already exist when it "adds indexes to the derived table"

Consider the following query:

SELECT * FROM t1
JOIN (SELECT * FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;

The documentation says: "The optimizer constructs an index over column f1 from derived_t2 if doing so would permit the use of ref access for the lowest cost execution plan."

OK, that's great, but does the optimizer re-use indexes from t2? In other words, what if an index existed for t2.f1? Does this index get re-used, or does the optimizer recreate this index for the derived table? Who knows?

EDIT: The best solution until MySQL 5.6 is to create a temporary table, create an index on that table, and then run the SELECT query on the temp table.

How do I ask for help optimizing & fixing queries in MySQL?

Use SHOW CREATE TABLE


This tells me more about your tables than your words ever could:

mysql> show create table magic\G
*************************** 1. row ***************************
Table: magic
Create Table: CREATE TABLE `magic` (
`id` int(11) DEFAULT NULL,
`what` varchar(255) DEFAULT NULL,
`the` datetime DEFAULT NULL,
`heck` text,
`soup_is_good` double DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

CAVEAT: If you have 70 columns in your table, omit the unnecessary ones. What's necessary?

  • Fields JOINed on
  • Fields SELECTed
  • Fields WHEREed on

Use EXPLAIN


This allows me to see how best to optimize your currently working, yet presumably slow query:

mysql> explain select *     from magic\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: magic
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
Extra:
1 row in set (0.00 sec)

Use \G


Having to scroll right is generally an inconvenience.

Usual:

mysql> select * from magic;
+------------+-------------------------------+---------------------+-------------------+--------------+
| id | what | the | heck | soup_is_good |
+------------+-------------------------------+---------------------+-------------------+--------------+
| 1000000000 | A really long text string yay | 2009-07-29 22:28:17 | OOOH A TEXT FIELD | 100.5 |
+------------+-------------------------------+---------------------+-------------------+--------------+
1 row in set (0.00 sec)

Better:

mysql> select * from magic\G
*************************** 1. row ***************************
id: 1000000000
what: A really long text string yay
the: 2009-07-29 22:28:17
heck: OOOH A TEXT FIELD
soup_is_good: 100.5
1 row in set (0.00 sec)

CAVEAT: \G obviously turns one row of data into several. This becomes equally cumbersome for several rows of data. Do what looks best.

Use an external pastebin for obnoxiously large chunks of data:

  • Pastie
  • gist.github

Let us know your expectations


  • Slow? - We don't know what slow is to you. Seconds, minutes, hours? It helps to know.
  • Faster - We don't know this either. What's your expectation of fast?
  • Frequency - Is this a query that you plan to run just once? Daily? Hundreds or thousands of times a day? This helps us know when it's Good Enough.

How to optimize execution plan for query with multiple outer joins to huge tables, group by and order by clauses?

Bill Karwin suggests the query might perform better if it used an index with a leading column of manufacture. I second that suggestion. Especially if that's very selective.

I also note that we're doing a GROUP BY t.id, where id is the PRIMARY KEY of the table.

No columns from any tables other than tracking are referenced in the SELECT list.

This suggests we're really only interested in returning rows from t, and not on creating duplicates due to multiple outer joins.

Seems like the COUNT() aggregate has the potential to return an inflated count, if there are multiple matching rows in tracking_item and bikes,cars,trucks. If there's three matching rows from cars, and four matching rows from bikes, ... the COUNT() aggregate is going to return a value of 12, rather than 7. (Or maybe there is some guarantee in the data such that there won't ever be multiple matching rows.)

If the manufacture is very selective, and that returns a reasonably small set of rows from tracking, if the query can make use of an index ...

And since we aren't returning any columns from any tables other than tracking, apart from a count or related items ...

I would be tempted to test correlated subqueries in the SELECT list, to get the count, and filter out the zero count rows using a HAVING clause.

Something like this:

SELECT SQL_NO_CACHE `t`.*
, ( ( SELECT COUNT(1)
FROM `tracking_items` `tic`
JOIN `cars` `c`
ON `c`.`car_id` = `tic`.`tracking_object_id`
AND `c`.`car_text` LIKE '%europe%'
WHERE `tic`.`tracking_id` = `t`.`id`
AND `tic`.`tracking_type` = 1
)
+ ( SELECT COUNT(1)
FROM `tracking_items` `tib`
JOIN `bikes` `b`
ON `b`.`bike_id` = `tib`.`tracking_object_id`
AND `b`.`bike_text` LIKE '%europe%'
WHERE `tib`.`tracking_id` = `t`.`id`
AND `tib`.`tracking_type` = 2
)
+ ( SELECT COUNT(1)
FROM `tracking_items` `tit`
JOIN `trucks` `tr`
ON `tr`.`truck_id` = `tit`.`tracking_object_id`
AND `tr`.`truck_text` LIKE '%europe%'
WHERE `tit`.`tracking_id` = `t`.`id`
AND `tit`.`tracking_type` = 3
)
) AS cnt_filtered_items
FROM `tracking` `t`
WHERE `t`.`manufacture` IN ('1256703406078', '9600048390403', '1533405067830')
HAVING cnt_filtered_items > 0
ORDER
BY `t`.`date_last_activity` ASC
, `t`.`id` ASC

We'd expect that the query could make effective use of an index on tracking with leading column of manufacture.

And on the tracking_items table, we want an index with leading columns of type and tracking_id. And including tracking_object_id in that index would mean the query could be satisfied from the index, without visiting the underlying pages.

For the cars, bikes and trucks tables the query should make use of an index with leading column of car_id, bike_id, and truck_id respectively. There's no getting around a scan of the car_text, bike_text, truck_text columns for the matching string... best we can do is narrow down the number rows that need to have that check performed.

This approach (just the tracking table in the outer query) should eliminate the need for the GROUP BY, the work required to identify and collapse duplicate rows.

BUT this approach, replacing joins with correlated subqueries, is best suited to queries where there is a SMALL number of rows returned by the outer query. Those subqueries get executed for every row processed by the outer query. It's imperative that those subqueries to have suitable indexes available. Even with those tuned, there is still potential for horrible performance for large sets.

This does still leave us with a "Using filesort" operation for the ORDER BY.


If the count of related items should be the product of a multiplication, rather than addition, we could tweak the query to achieve that. (We'd have to muck with the return of zeros, and the condition in the HAVING clause would need to be changed.)

If there wasn't a requirement to return a COUNT() of related items, then I would be tempted to move the correlated subqueries from the SELECT list down into EXISTS predicates in the WHERE clause.


Additional notes: seconding the comments from Rick James regarding indexing... there appears to be redundant indexes defined. i.e.

KEY `manufacture` (`manufacture`)
KEY `manufacture_date_last_activity` (`manufacture`, `date_last_activity`)

The index on the singleton column isn't necessary, since there is another index that has the column as the leading column.

Any query that can make effective use of the manufacture index will be able to make effective use of the manufacture_date_last_activity index. That is to say, the manufacture index could be dropped.

The same applies for the tracking_items table, and these two indexes:

KEY `tracking_id` (`tracking_id`)
KEY `tracking_id_tracking_object_id` (`tracking_id`,`tracking_object_id`)

The tracking_id index could be dropped, since it's redundant.

For the query above, I would suggest adding a covering index:

KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`,`tracking_object_id`)

-or- at a minimum, a non-covering index with those two columns leading:

KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`)

Optimize query selecting a period

There is one caveat to my solution:

1) The caveat to this solution is that you must be using the MyISAM engine for the events table. If you cannot use MyISAM then this solution wont work because only MyISAM is supported for Spatial Indexes.

So, assuming that the above isn't an issue for you, the following should work and give you good performance:

This solution makes use of MySQL's support for Spatial Data (see documentation here). While spatial data types can be added to a variety of storage engines, only MyISAM is supported for Spatial R-Tree Indexes (see documentation here) which are needed in order to get the performance needed. One other limitation is that spatial data types only work with numerical data so you cannot use this technique with string based range queries.

I wont go into the details of the theory behind how spatial types work and how the spatial index is useful but you should look at Jeremy Cole's explanation here in regards to how to use spatial data types and indexes for GeoIP lookups. Also look at the comments as they raise some useful points and alternative if you need raw performance and can give up some accuracy.

The basic premise is that we can take the start/end and use the two of them to create four distinct points, one for each corner of a rectangle centered around 0,0 on a xy grid, and then do a quick lookup into the spatial index to determine if the particular point in time we care about is within the rectangle or not. As mentioned previously, see Jeremy Cole's explanation for a more thorough overview of how this works.

In your particular case we will need to do the following:

1) Alter the table to be a MyISAM table (note you shouldn't do this unless you are fully aware of the consequences of such a change like the lack of transactions and the table locking behavior that are associated with MyISAM).

alter table events engine = MyISAM;

2) Next we add the new column that will hold the spatial data. We will use the polygon data type as we need to be able to hold a full rectangle.

alter table events add column time_poly polygon NOT NULL;

3) Next we populate the new column with the data (please keep in mind that any processes that update or insert into table events will need to get modified to make sure they are populating the new column also). Since the start and end ranges are times, we will need to convert them to numbers with the unix_timestamp function (see documentation here for how it works).

update events set time_poly := LINESTRINGFROMWKB(LINESTRING(
POINT(unix_timestamp(start_time), -1),
POINT(unix_timestamp(end_time), -1),
POINT(unix_timestamp(end_time), 1),
POINT(unix_timestamp(start_time), 1),
POINT(unix_timestamp(start_time), -1)
));

4) Next we add the spatial index to the table (as mentioned previously, this will only work for a MyISAM table and will produce the error "ERROR 1464 (HY000): The used table type doesn't support SPATIAL indexes").

alter table events add SPATIAL KEY `IXs_time_poly` (`time_poly`);

5) Next you will need to use the following select in order to make use of the spatial index when querying the data.

SELECT * 
FROM events force index (IXs_time_poly)
WHERE MBRCONTAINS(events.time_poly, POINTFROMWKB(POINT(unix_timestamp('2009-02-18 16:27:12'), 0)));

The force index is there to make 100% sure that MySQL will use the index for the lookup. If everything went well running an explain on the above select should show something similar to the following:

mysql> explain SELECT *
-> FROM events force index (IXs_time_poly)
-> on MBRCONTAINS(events.time_poly, POINTFROMWKB(POINT(unix_timestamp('2009-02-18 16:27:12'), 0)));
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+
| 1 | SIMPLE | B | range | IXs_time_poly | IXs_time_poly | 32 | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+
1 row in set (0.00 sec)

Please refer to Jeremy Cole's analysis for details about the performance benefits of this method as compared with a between clause.

Let me know if you have any questions.

Thanks,

-Dipin

Please help optimizing a long running query (left outer join, with 2 derived tables)

Subselects are usually the quickest way to a slow query. I'm not sure exactly what you are trying to do, but you could select the BN from pdata with the FPE with the following query

SELECT p.* FROM pdata p 
LEFT JOIN pdata p0 ON p.BN = p0.BN AND p.FPE < p0.FPE
WHERE p0.BN IS NULL

Similarly, if you had some column in tdata that was unique (or unique among rows with the same BN)

SELECT t.* FROM tdata t 
LEFT JOIN tdata t0 ON t.BN = t0.BN AND t.SOMEUNIQUEKEY != t0.SOMEUNIQUEKEY
WHERE t0.BN IS NULL

Something quirky about subselects: they are always much slower than an equivalent join. I think it's a bug.

EDIT

Hooper, the ichthyologist, was unclear on how the LEFT JOIN pdata p0 ON p.BN = p0.BN ... WHERE p0.BN IS NULL stuff worked. Let me go through it step by step with a much simpler example. You have a table names, with a last name and a first name and you want to find the unique last names (that is, every last name held by only one person. The data are as follows:

last first
Smith Will
Smith John
Smith Adam
Jones John

First try the left-join by itself

SELECT n1.last, n1.first, n2.last, n2.first FROM names n1 
LEFT JOIN names n2 ON n1.last = n2.last and n1.first != n2.first

that will return

last first last first
Smith Will Smith John
Smith John Smith Will
Smith Adam Smith John
Smith Will Smith Adam
Smith John Smith Adam
Smith Adam Smith Will
Jones John NULL NULL

Notice those nulls in the last row? That was no boating accident, that is the difference between an ordinary inner join and a left join. An inner join (find all pairs of rows with the same last name and a different first name) would have found the first six, but ignore the unpaired seventh. The only function of the LEFT JOIN is to pad out with nulls anything not filled in by the ON clause.

Now we pull only that row:

SELECT n1.last, n1.first, n2.last, n2.first FROM names n1 
LEFT JOIN names n2 ON n1.last = n2.last and n1.first != n2.first
WHERE n2.last IS NULL

And (assuming no nulls in the underlying data), we get only the row that the ON clause failed to match.

Ta, and if I may so so, da.



Related Topics



Leave a reply



Submit