How to Optimize MySQL Query (Group and Order)

How to optimize MySQL query (group and order)

I'm tracking views to different pages, and I want to know the highest page per session, in order to know how far they've clicked through (they're required to view every page all the way to the end) in any given session.

Ordering before grouping is a highly unreliable way to do this.

MySQL extends GROUP BY syntax: you can use ungrouped and unaggregated fields in SELECT and ORDER BY clauses.

In this case, a random value of page is output per each session.

Documentation explicitly states that you should never make any assumptions on which value exactly will it be:

Do not use this feature if the columns you omit from the GROUP BY part are not constant in the group. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.

However, in practice, the values from the first row scanned are returned.

Since you are using an ORDER BY page DESC in your subquery, this row happens to be the rows with a maximal page per session.

You shouldn't rely on it, since this behaviour is undocumented and if some other row will be returned in next version, it will not be considered a bug.

But you don't even have to do such nasty tricks.

Just use aggregate functions:

SELECT  MAX(page)
FROM    views
WHERE   user_id = '1'
GROUP BY
        session

This is documented and clean way to do what you want.

Create a composite index on (user_id, session, page) for the query to run faster.

If you need all columns from your table, not only the aggregated ones, use this syntax:

SELECT  v.*
FROM    (
        SELECT  DISTINCT user_id, session
        FROM    views
        ) vo
JOIN    views v
ON      v.id =
        (
        SELECT  id
        FROM    views vi
        WHERE   vi.user_id = vo.user_id
                AND vi.session = vo.session
        ORDER BY
                page DESC
        LIMIT 1
        )

This assumes that id is a PRIMARY KEY on views.

Optimization of group by order by

My Answer is rather long winded, but I hope you will learn several things. And I give you two possible improvements.

"Prevent the use of temp tables" and "Prevent 'filesort'". Neither of these is the real goal. The real goal is a faster query.

GROUP BY one_thing
ORDER BY something_else

will always (I think) need at least one temp and filesort, sometimes two. It is simply necessary to achieve your goal.

On the flip side, a temp+filesort needed to support a SELECT is not necessarily a disk-based "file". It is often merely an in-memory set of data (actually a MEMORY table).

Let's look further at what you have:

Filter on a.timestamp -- but a "range"
GROUP BY a.player_id
ORDER BY an aggregate -- not know up front, so no way to use an index.

If the optimizer does things in the order given, it could

use an index starting with timestamp for filtering, and write that to a tmp table
sort to do the GROUP BY
sort again to do the ORDER BY.

(I may be pessimistic about how the GROUP BY processing is done. Use EXPLAIN FORMAT=JSON SELECT... to get more insight.)

You suggested a composite INDEX(timestamp, player_id). Well, that won't be useful since the first part is used in a range. Think of this: You have a long list of people and their birth-years. And you want all those with last name starting with 'B' and you want to group them by birth year. What would be the optimal way to arrange the list so you are not copying things over and sorting them? Then add on sorting by the most common birth year.

Back to the composite index. As a general rule, if you are using the first column in the index in a 'range' context, the rest of the index goes unused.

So, the most useful index for the given query is merely INDEX(timestamp). Correction: INDEX(timestamp, player_id) is better because it is a "covering index", hence avoids reaching into the data. EXPLAIN gives you the clue with Using index.

Please provide SHOW CREATE TABLE for both tables; I am having to guess from here out...

I guess that player has PRIMARY KEY(player_id), correct?

You are using LEFT because buyout queries reference non-existent players? Seems unlikely, so I will guess that you added LEFT for no valid reason.

Also, I'll guess you said COUNT(a.player_id) instead of COUNT(*) for no valid reason.

Once you get rid of the LEFT, we can try another formulation of the query:

SELECT  b.player_id, 
      ( SELECT  COUNT(*)
            FROM  buyout_calculator_query
            WHERE  player_id = b.player_id
              AND  timestamp >259200 
      ) AS views,
      b.firstname, b.lastname, b.link_id
    FROM  player AS b
    ORDER BY  views DESC

See if that runs faster. It has a "correlated subquery" but avoids the GROUP BY. Please add this to buyout_calculator_query: INDEX(player_id, timestamp).

Going a step further, this may (or may not) be better:

SELECT  b.player_id, a.views, b.firstname, b.lastname, b.link_id
    FROM  
      ( SELECT  player_id, COUNT(*) AS views
            FROM  buyout_calculator_query
            WHERE  timestamp >259200
            GROUP BY  player_id 
      ) AS a
    JOIN  player AS b USING(player_id)
    ORDER BY  a.views DESC

This will be "Using index" if you have INDEX(player_id, timestamp); that is an extra boost by avoiding bouncing between the index and the data. Plus the subquery need no tmp table, nor filesort. But the subquery generates a tmp table, and the ORDER BY will need a sort.

Optimize SQL query when GROUP BY and ORDER BY expressions are different?

GROUP BY is normally used with functions like SUM() to aggregate records. Your query doesn't seem to require the group by clause as such. Would the following work better?

SELECT *
FROM (product)
WHERE Category1 = 'PC'
AND Category2 = 'desktop'
ORDER BY product_code, reviews desc, popularity desc
LIMIT 10

You would create an index to match of course.

http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html

GROUP BY query optimization

The EXPLAIN verifies the (game, finish, user) index was used in the query. That seems like the best possible index to me. Could it be a hardware issue? What is your system RAM and CPU?

Optimizing ORDER BY

This is a very interesting query. During its optimisation you may discover and understand a lot of new information about how MySQL works. I am not sure that I will have time to write everything in details at once, but I can gradually update.

Why it is slow

There are basically two scenarios: a quick and a slow.

In a quick scenario you are walking in some predefined order over a table and probably at the same time quickly fetch some data by id for each row from other tables. It this case you stop walking as soon as you have enough rows specified by your LIMIT clause. Where does the order come from? From a b-tree index that you have on the table or the order of a result set in a subquery.

In a slow scenario you do not have that predefined order, and MySQL has to implicitly put all data into a temporary table, sort the table on some field and return the n rows from your LIMIT clause. If any of the fields that you put into that temporary table is of type TEXT (not VARCHAR), MySQL does not even try to keep that table in RAM and flushes and sorts it on disk (hence additional IO processing).

First thing to fix

There are many situations when you can not build an index that will allow you to follow its order (when you ORDER BY columns from different tables, for example), so the rule of thumb in such situations is to minimise the data that MySQL will put in the temporary table. How can you do it? You select only identifiers of the rows in a subquery and after you have the ids, you join the ids to the table itself and other tables to fetch the content. That is you make a small table with an order and then use the quick scenario. (This slightly contradicts to SQL in general, but each flavor of SQL has its own means to optimise queries that way).

Coincidentally, your SELECT -- everything is ok here looks funny, since it is the first place where it is not ok.

SELECT p.*
    , u.name user_name, u.status user_status
    , c.name city_name, t.name town_name, d.name dist_name
    , pm.meta_name, pm.meta_email, pm.meta_phone
    , (SELECT concat("{", 
        '"id":"', pc.id, '",', 
        '"content":"', replace(pc.content, '"', '\\"'), '",', 
        '"date":"', pc.date, '",', 
        '"user_id":"', pcu.id, '",', 
        '"user_name":"', pcu.name, '"}"') last_comment_json 
        FROM post_comments pc 
        LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
        WHERE pc.post_id = p.id
        ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM (
    SELECT id
    FROM posts p
    WHERE p.status = 'published'
    ORDER BY 
        (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
            THEN +p.reputation ELSE NULL END) DESC, 
        p.id DESC
    LIMIT 0,10
) ids
JOIN posts p ON ids.id = p.id  -- mind the join for the p data
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
;

That is the first step, but even now you can see that you do not need to make these useless LEFT JOINS and json serialisations for the rows you do not need. (I skipped GROUP BY p.id, because I do not see which LEFT JOIN might result in several rows, you do not do any aggregation).

yet to write about:

indexes
reformulate CASE clause (use UNION ALL)
probably forcing an index

MySQL optimization with joins, group by, order by count

Have you tried narrow your table before you joining them all together ?

SELECT 
    customer.rep_id AS `ID`,
    COUNT(*) AS Count,
    contact.name
FROM 
    (
        SELECT 
            id, rep_id
        FROM 
            customer
            JOIN (
                SELECT 
                    customer_id 
                FROM 
                    appointment
                WHERE 
                    date >= '2017-05-01'
                    AND 
                    appointment.date < '2017-06-01'
                    AND 
                    appointment.current = 1
                    AND 
                    appointment.`status` = 'completed'
            ) AS appointment 
            ON customer.id = appointment.customer_id
        WHERE
            customer.rep_id != 0
            AND 
            customer.saved = 0
            AND 
            customer.deleted = 0
    ) AS customer
    JOIN contact 
        ON customer.rep_id = contact.user_id
    JOIN (
        SELECT
            id
        FROM
            user
        WHERE
            user.active = 1
            AND 
            user.deleted = 0
    ) AS user 
        ON contact.user_id = user.id        
GROUP BY customer.rep_id 
ORDER BY `Count` DESC
LIMIT 50

Optimizing MySQL query using GROUP BY on time functions

create a single composite index on jobid, start, location, step

then group by that order first, and sort it:

SELECT location, step, COUNT(*), AVG(foo), YEAR(start), MONTH(start), DAY(start)
FROM table WHERE jobid = 'xxx' AND start BETWEEEN '2010-01-01' AND '2010-01-08'
GROUP BY YEAR(start), MONTH(start), DAY(start), location, step
ORDER BY location, step, YEAR(start), MONTH(start), DAY(start)

UPDATE

Looks like MySql cannot use the index when the YEAR,MONTH and DAY functions are used. since

After removing the start from the WHERE clause, the explain still shows using filesort
Adding 3 columns: y = YEAR(start), m = MONTH(start), d=DAY(start), creating a index on jobid, y, m, d, location, step and updating the WHERE ... AND y = 2010 AND m = 12 AND d BETWEEN 1 AND 08 does remove the using temporary using filesort.

keeping 3 extra column seems like a bad idea, since the performance difference between the GROUP BY shouldn't matter that much if it uses temporary or not.

Optimizing Left Join With Group By and Order By (MariaDb)

how about:

WITH a AS (
   SELECT u.id, u.display_name, u.cell_phone, u.email
   FROM users u 
   WHERE u.is_deleted = 0
   GROUP BY u.id
   LIMIT 0, 10
) 
SELECT a.id, a.display_name, a.cell_phone, a.email, 
       uv.year, uv.make, uv.model, uv.id AS user_vehicle_id
FROM a LEFT JOIN user_vehicles uv ON uv.user_id = a.id AND uv.current_owner=1
ORDER BY a.display_name;

The intention is we take a subset of users before joining it with user_vehicles.
Disclaimer: I haven't verified if its faster or not, but have similar experience in the past where this helps.

How to Optimize MySQL Query (Group and Order)