Sql Query Pervious Row Optimisation
SELECT t1.FileName, t1.CreatedDate, t2.CreatedDate as PrevCreatedDate
FROM
(SELECT FileName, CreateDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable) t1
LEFT JOIN
(SELECT FileName, CreateDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable) t2
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo - 1)
Or may be better use 'WITH', because queries is identical:
WITH t(ObjectID, FileName, CreatedDate, OrderNo) AS
(SELECT ObjectID, FileName, CreatedDate,
ROW_NUMBER() OVER(PARTITION BY FileName ORDER BY CreatedDate) AS OrderNo
FROM MyTable)
SELECT t1.ObjectID, t1.FileName, t1.CreatedDate, t2.CreatedDate AS PrevCreatedDate,
DATEDIFF("SS", '1900-01-01 00:00:00',
COALESCE((t1.CreatedDate - t2.CreatedDate),0)) AS secondsTaken
FROM t t1 LEFT JOIN t t2
ON (t1.FileName = t2.FileName AND t1.OrderNo = t2.OrderNo + 1)
Optimizing Query. Want to pick the last record without using max in sub-query
It's likely slow because it's running a correlated subquery for every row of the outer query. There are two solutions that tend to run more efficiently.
One is to use a derived table, which uses a subquery, but it only executes the subquery once to prepare the derived table.
SELECT B.RECORDID, A.ITEMCODE, A.ITEMNAME, A.STOCKINHAND, B.SALEPRICE
FROM ITEMMASTER A
JOIN STOCKENTRY B ON A.ITEMID = B.ITEMID
JOIN (SELECT ITEMID, MAX(RECORDID) AS MAXRECORDID
FROM STOCKENTRY GROUP BY ITEMID) M
ON (M.ITEMID, M.MAXRECORDID) = (B.ITEMID, B.RECORDID)
WHERE A.STOCKINHAND > 0
AND B.SALEPRICE > 0
AND B.INVOICEDATE IS NOT NULL
ORDER BY A.ITEMNAME, B.INVOICEDATE;
The other solution is to use an exclusion join to find the row in B such that no other row exists with the same itemid and a greater recordid. With correct indexes (e.g. a compound index on (ITEMID, RECORDID), this should perform very well.
SELECT B.RECORDID, A.ITEMCODE, A.ITEMNAME, A.STOCKINHAND, B.SALEPRICE
FROM ITEMMASTER A
JOIN STOCKENTRY B ON A.ITEMID = B.ITEMID
LEFT OUTER JOIN STOCKENTRY B2
ON B.ITEMID = B2.ITEMID AND B.RECORDID < B2.RECORDID
WHERE B2.ITEMID IS NULL
AND A.STOCKINHAND > 0
AND B.SALEPRICE > 0
AND B.INVOICEDATE IS NOT NULL
ORDER BY A.ITEMNAME, B.INVOICEDATE;
This type of problem comes up frequently on Stack Overflow. I've added the greatest-n-per-group tag to the question so you can see other cases.
Re @RPK's comment:
I don't use MySQL QB myself, and that app has changed so many times I can't advise on how to use it. But in the mysql monitor (command-line), I use a combination of EXPLAIN and PROFILING to give me stats.
However, you made a comment about not being to modify (or create?) indexes. That's going to hamstring your attempts to optimize.
Optimizing a SQL query by only looking in the last X rows (Not simply LIMIT)
In a database you cannot do a query that checks the "last x rows". A relational database does not guarantee that the rows are physically stored in a specific order. And therefore SQL will not allow you to express that. If you can translate that into an actual constraint based on the data contained in the rows then that would be possible to achieve.
Taking your example, the worst operation the database has to do is the sort of the full result set before returning the data. This is regardless of the limit
clause, because only after you run through all the rows and sorted them do you know which rows have the highest ids.
However, if there is an index with columna
and id
, by that order, the database engine should use the index, which is sorted, to go through the rows much faster, resulting in a faster response time.
optimize select rows based on previous select result
First, the distinct
in the subquery should be unnecessary. I'm not sure if MySQL optimizes it away. So, start with:
SELECT DISTINCT f.`value`
FROM `fields` f
WHERE f.aid = 8 AND
f.product_id IN (SELECT f2.product_id
FROM `fields` f2
WHERE f2.aid = 6 AND f2.`value` = 3
);
For this query, you want an index on fields(aid, value, product_id)
.
In earlier versions of MySQL, it would be better to replace the IN
subquery with EXISTS
. If your query finishes in one second now, then you are probably on a more recent version.
Optimize updating first, last, and second to last ranked value
Since we have the all-important index on (user_id, created_at)
, I suggest:
UPDATE users u
SET first_at = h.first_at
, latest_at = h.latest_at
, previous_at = h.previous_at
FROM (
SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
FROM users u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT h.created_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY h.created_at DESC
LIMIT 2
) AS last
) l
CROSS JOIN LATERAL (
SELECT created_at AS first_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY created_at
LIMIT 1
) f
WHERE u.id BETWEEN $1 AND $2
) h
WHERE u.id = h.id
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at);
This works with non-unique timestamps per user_id
, too.
And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at)
instead.
Related:
- Optimize GROUP BY query to retrieve latest row per user
Assuming most or all users get updated this way, we don't need an index on users
. (For the purpose of this UPDATE
, no index would be best.)
If there is only a single row in table history
for a user, then previous_at
is set to NULL
. (Your original query has the same effect.)
Only users are updated where qualifying history rows are found.
This added WHERE
clause skips updates that would not change anything (at full cost):
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at)
See:
- How do I (or can I) SELECT DISTINCT on multiple columns?
The only insecurity is with WHERE type = 'SomeType'
. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...
Since the new query should be much faster, you might update more (or all) users at once.
Get latest row by checking a value in previous row
This won't win any medals as far as performance is concerned but...
select *
from t
where status = 'failed'
and coalesce((
-- previous status
-- coalesce is needed when there is no previous row
select status
from t as x
where x.job_name = t.job_name and x.start_date < t.start_date
order by x.start_date desc
limit 1
), 'success') = 'success'
and not exists (
-- no next record exists
select 1
from t as x
where x.job_name = t.job_name and x.start_date > t.start_date
)
SQL: selecting rows where column value changed from previous row
SELECT a.*
FROM tableX AS a
WHERE a.StatusA <>
( SELECT b.StatusA
FROM tableX AS b
WHERE a.System = b.System
AND a.Timestamp > b.Timestamp
ORDER BY b.Timestamp DESC
LIMIT 1
)
But you can try this as well (with an index on (System,Timestamp)
:
SELECT System, Timestamp, StatusA, StatusB
FROM
( SELECT (@statusPre <> statusA AND @systemPre=System) AS statusChanged
, System, Timestamp, StatusA, StatusB
, @statusPre := StatusA
, @systemPre := System
FROM tableX
, (SELECT @statusPre:=NULL, @systemPre:=NULL) AS d
ORDER BY System
, Timestamp
) AS good
WHERE statusChanged ;
How to optimize query to compute row-dependent datetime relationships?
First, if referential integrity is enforced with FK constraints, you can drop the patient
table from the query completely:
SELECT COUNT(DISTINCT patient) -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;
Next, get rid of the repeated multiplication of rows and the DISTINCT
to counter that in the outer SELECT
by using EXISTS
semi-joins instead:
SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';
Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.
If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):
SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);
See:
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest record per user
You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer
(or even int2
or "char"
value as FK.)
For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC)
and make sure all three columns are defined NOT NULL
. The order of index expressions is important. DESC
is mostly optional here. Postgres can use the index with default ASC
sort order almost as efficiently in your case.
If VACUUM
(preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.
Related:
- Optimizing queries on a range of timestamps (two columns)
- Select Items that has one item but not the other
- How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
To implement your additional time frames (your "advanced use case"), build on the second query since we have to consider all events again.
You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id
of the referenced event!) Dates / timestamps alone are error-prone.
SELECT COUNT(*) -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';
About date
/ timestamp
arithmetic:
- How to get the end of a day?
Related Topics
Using 'In' with a Sub-Query in SQL Statements
Why Does Comparing a SQL Date Variable to Null Behave in This Way
Remove Blank-Padding from To_Char() Output
SQL Conversion from Varchar to Uniqueidentifier Fails in View
Using Except Clause in Postgresql
Rails Active Query Order by Multiple Values in Specific Order
Spark Dataframe Groupping Does Not Count Nulls
Delete a Query from Excel Workbook with Vba
SQL Server 2008 Using Sum() Over(Order By...)
How to Select More Than 1 Record Per Day
Calling Stored Procedure While Passing Parameters from Access Module in Vba
Selecting Multiple Max() Values Using a Single SQL Statement
Ms-Access -> Select as + Order by = Error
Setting Up Foreign Key with Different Datatype
SQL Server Management Studio - How to Change a Field Type Without Dropping Table