SQL Query Pervious Row Optimisation

Sql Query Pervious Row Optimisation

SELECT t1.FileName, t1.CreatedDate, t2.CreatedDate as PrevCreatedDate
FROM
(SELECT FileName, CreateDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable) t1
LEFT JOIN
(SELECT FileName, CreateDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable) t2
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo - 1)

Or may be better use 'WITH', because queries is identical:

WITH t(ObjectID, FileName, CreatedDate, OrderNo) AS
(SELECT ObjectID, FileName, CreatedDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable)
SELECT t1.ObjectID, t1.FileName, t1.CreatedDate, t2.CreatedDate AS PrevCreatedDate,
DATEDIFF("SS", '1900-01-01 00:00:00',
COALESCE((t1.CreatedDate - t2.CreatedDate),0)) AS secondsTaken
FROM t t1 LEFT JOIN t t2
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo + 1)

Optimizing Query. Want to pick the last record without using max in sub-query

It's likely slow because it's running a correlated subquery for every row of the outer query. There are two solutions that tend to run more efficiently.

One is to use a derived table, which uses a subquery, but it only executes the subquery once to prepare the derived table.

SELECT B.RECORDID, A.ITEMCODE, A.ITEMNAME, A.STOCKINHAND, B.SALEPRICE 
FROM ITEMMASTER A
JOIN STOCKENTRY B ON A.ITEMID = B.ITEMID
JOIN (SELECT ITEMID, MAX(RECORDID) AS MAXRECORDID
FROM STOCKENTRY GROUP BY ITEMID) M
ON (M.ITEMID, M.MAXRECORDID) = (B.ITEMID, B.RECORDID)
WHERE A.STOCKINHAND > 0
AND B.SALEPRICE > 0
AND B.INVOICEDATE IS NOT NULL
ORDER BY A.ITEMNAME, B.INVOICEDATE;

The other solution is to use an exclusion join to find the row in B such that no other row exists with the same itemid and a greater recordid. With correct indexes (e.g. a compound index on (ITEMID, RECORDID), this should perform very well.

SELECT B.RECORDID, A.ITEMCODE, A.ITEMNAME, A.STOCKINHAND, B.SALEPRICE 
FROM ITEMMASTER A
JOIN STOCKENTRY B ON A.ITEMID = B.ITEMID
LEFT OUTER JOIN STOCKENTRY B2
ON B.ITEMID = B2.ITEMID AND B.RECORDID < B2.RECORDID
WHERE B2.ITEMID IS NULL
AND A.STOCKINHAND > 0
AND B.SALEPRICE > 0
AND B.INVOICEDATE IS NOT NULL
ORDER BY A.ITEMNAME, B.INVOICEDATE;

This type of problem comes up frequently on Stack Overflow. I've added the greatest-n-per-group tag to the question so you can see other cases.


Re @RPK's comment:

I don't use MySQL QB myself, and that app has changed so many times I can't advise on how to use it. But in the mysql monitor (command-line), I use a combination of EXPLAIN and PROFILING to give me stats.

However, you made a comment about not being to modify (or create?) indexes. That's going to hamstring your attempts to optimize.

Optimizing a SQL query by only looking in the last X rows (Not simply LIMIT)

In a database you cannot do a query that checks the "last x rows". A relational database does not guarantee that the rows are physically stored in a specific order. And therefore SQL will not allow you to express that. If you can translate that into an actual constraint based on the data contained in the rows then that would be possible to achieve.

Taking your example, the worst operation the database has to do is the sort of the full result set before returning the data. This is regardless of the limit clause, because only after you run through all the rows and sorted them do you know which rows have the highest ids.

However, if there is an index with columna and id, by that order, the database engine should use the index, which is sorted, to go through the rows much faster, resulting in a faster response time.

optimize select rows based on previous select result

First, the distinct in the subquery should be unnecessary. I'm not sure if MySQL optimizes it away. So, start with:

SELECT DISTINCT f.`value`
FROM `fields` f
WHERE f.aid = 8 AND
f.product_id IN (SELECT f2.product_id
FROM `fields` f2
WHERE f2.aid = 6 AND f2.`value` = 3
);

For this query, you want an index on fields(aid, value, product_id).

In earlier versions of MySQL, it would be better to replace the IN subquery with EXISTS. If your query finishes in one second now, then you are probably on a more recent version.

Optimize updating first, last, and second to last ranked value

Since we have the all-important index on (user_id, created_at), I suggest:

UPDATE users u
SET first_at = h.first_at
, latest_at = h.latest_at
, previous_at = h.previous_at
FROM (
SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
FROM users u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT h.created_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY h.created_at DESC
LIMIT 2
) AS last
) l
CROSS JOIN LATERAL (
SELECT created_at AS first_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY created_at
LIMIT 1
) f
WHERE u.id BETWEEN $1 AND $2
) h
WHERE u.id = h.id
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at);

This works with non-unique timestamps per user_id, too.

And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at) instead.
Related:

  • Optimize GROUP BY query to retrieve latest row per user

Assuming most or all users get updated this way, we don't need an index on users. (For the purpose of this UPDATE, no index would be best.)

If there is only a single row in table history for a user, then previous_at is set to NULL. (Your original query has the same effect.)

Only users are updated where qualifying history rows are found.

This added WHERE clause skips updates that would not change anything (at full cost):

AND   (u.first_at    IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at)

See:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

The only insecurity is with WHERE type = 'SomeType'. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...

Since the new query should be much faster, you might update more (or all) users at once.

Get latest row by checking a value in previous row

This won't win any medals as far as performance is concerned but...

select *
from t
where status = 'failed'
and coalesce((
-- previous status
-- coalesce is needed when there is no previous row
select status
from t as x
where x.job_name = t.job_name and x.start_date < t.start_date
order by x.start_date desc
limit 1
), 'success') = 'success'
and not exists (
-- no next record exists
select 1
from t as x
where x.job_name = t.job_name and x.start_date > t.start_date
)

SQL: selecting rows where column value changed from previous row

SELECT a.*
FROM tableX AS a
WHERE a.StatusA <>
( SELECT b.StatusA
FROM tableX AS b
WHERE a.System = b.System
AND a.Timestamp > b.Timestamp
ORDER BY b.Timestamp DESC
LIMIT 1
)

But you can try this as well (with an index on (System,Timestamp):

SELECT System, Timestamp, StatusA, StatusB
FROM
( SELECT (@statusPre <> statusA AND @systemPre=System) AS statusChanged
, System, Timestamp, StatusA, StatusB
, @statusPre := StatusA
, @systemPre := System
FROM tableX
, (SELECT @statusPre:=NULL, @systemPre:=NULL) AS d
ORDER BY System
, Timestamp
) AS good
WHERE statusChanged ;

How to optimize query to compute row-dependent datetime relationships?

First, if referential integrity is enforced with FK constraints, you can drop the patient table from the query completely:

SELECT COUNT(DISTINCT patient)  -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;

Next, get rid of the repeated multiplication of rows and the DISTINCT to counter that in the outer SELECT by using EXISTS semi-joins instead:

SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';

Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.

If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):

SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);

See:

  • Select first row in each GROUP BY group?
  • Optimize GROUP BY query to retrieve latest record per user

You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer (or even int2 or "char" value as FK.)

For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC) and make sure all three columns are defined NOT NULL. The order of index expressions is important. DESC is mostly optional here. Postgres can use the index with default ASC sort order almost as efficiently in your case.

If VACUUM (preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.

Related:

  • Optimizing queries on a range of timestamps (two columns)
  • Select Items that has one item but not the other
  • How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?

To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.

You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id of the referenced event!) Dates / timestamps alone are error-prone.

SELECT COUNT(*)                    -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';

About date / timestamp arithmetic:

  • How to get the end of a day?


Related Topics



Leave a reply



Submit