SQL Server equivalent of PostgreSQL distinct on ()
You can try ROW_NUMBER
, but it can affect your performance.
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) Corr
FROM my_table
)
SELECT *
FROM CTE
WHERE Corr = 1
Converting SELECT DISTINCT ON queries from Postgresql to MySQL
There's not an exact equivalent to convert a Postgresql query that makes use of SELECT DISTINCT ON to MySQL.
Postgresql SELECT DISTINCT ON
In Postgresql, the following query will eliminate all rows where the expressions (col1, col2, col3)
match, and it will only keep the "first col4, col5 row" for each set of matched rows:
SELECT DISTINCT ON (col1, col2, col3) col4, col5
FROM tablename
So if your table is like this:
col1 | col2 | col3 | col4 | col5
--------------------------------
1 | 2 | 3 | 777 | 888
1 | 2 | 3 | 888 | 999
3 | 3 | 3 | 555 | 555
our query will keep just one row for (1,2,3) and one row for (3,3,3). The resulting rows will then be:
col4 | col5
-----------
777 | 888
555 | 555
please notice that the "first row" of each set is unpredictable, our fist row might be (888, 999) as well unless we specify an ORDER BY:
SELECT DISTINCT ON (col1, col2, col3) col4, col5
FROM tablename
ORDER BY col1, col2, col3, col4
(the DISTINCT on expressions must match the leftmost ORDER BY expressions, but the ORDER BY can contain additional expressions).
MySQL extension to GROUP BY
MySQL extends the use of GROUP BY so that we can select nonaggregated columns not named in the GROUP BY clause. Whenever we select nonaggregated columns the server is free to choose any value from each group from that column, so the resulting values will be indetermined.
So this Postgresql query:
SELECT DISTINCT ON (col1, col2, col3) col4, col5
FROM tablename
can be considered equivalent to this MySQL query:
SELECT col4, col5
FROM tablename
GROUP BY col1, col2, col3
both Postgresql and MySQL will return the "First row" for each (col1, col2, col3), and in both cases the row returned is unpredictable because we didn't specify and order by clause.
A lot of people would be very tempted to convert this Postgresql query with an ORDER BY:
SELECT DISTINCT ON (col1, col2, col3) col4, col5
FROM tablename
ORDER BY col1, col2, col3, col4
with this one:
SELECT col4, col5
FROM (
SELECT col1, col2, col3, col4, col5
FROM tablename
ORDER BY col1, col2, col3, col4
) s
GROUP BY col1, col2, col3
the idea here is to apply an ORDER BY to a subquery so that when MySQL groups by col1, col2, col3 it will keep the first encountered value for col4 and col5. The idea is good, but it's wrong! MySQL is free to choose any value for col4 and col5, and we don't know which are the first values encountered, it depends on the optimizer. So I would correct it to this:
SELECT t1.col4, t1.col5
FROM tablename t1 INNER JOIN (SELECT col1, col2, col3, MIN(col4) as m_col4
FROM tablename
GROUP BY col1, col2, col3) s
ON t1.col1=s.col1
AND t1.col2=s.col2
AND t1.col3=s.col3
AND t1.col4=s.m_col4
GROUP BY
t1.col1, t1.col2, t1.col3, t1.col4
but this is starting to get more complicated.
Conclusion
As a general rule, there's not an exact way to convert a Postgresql query to a MySQL query, but there are a lot of workarounds, the resulting query might be as simple as the original one or it might become very complicated, but it depends on the query itself.
DISTINCT ON equivalent solution for MYSQL? - Repex included
If you want to filter the pairs of invoice_num
and invoice_suffix
to instances where they only appear on one row in the table (perhaps that's what you mean by "distinct," although that explanation is throwing others off (or it's throwing me off)), then you can write the query you might be describing as an inline view, and then join into it to get the other column(s), like this:
select x.*
from distinct_repex x
join (
select invoice_num,
invoice_suffix
from distinct_repex
group by invoice_num,
invoice_suffix
having count(*) = 1
) y
on x.invoice_num = y.invoice_num
and x.invoice_suffix = y.invoice_suffix;
If what I've described is not what you want, you might be wanting to show the other columns associated with a particular record within a given invoice_num
and invoice_suffix
pair, based on the order of other column(s) of the table. That is what distinct on
does in Postgres. But what's confusing is that you didn't specify which column(s) you want ordered, and your example data has only one single third column, where the values don't even differ between the rows. So it's very confusing as to what you want. If in theory the example data included varying company names, and you wanted the first company in alphabetical order, that would look like the below. If your real table has a 4th, 5th, 6th... column, and you want the "chosen record" dictated by the order of certain other columns, you would have to tweak the below accordingly or provide a better example and explanation of what you want.
select y.invoice_num,
y.invoice_suffix,
y.company_name
from (
select x.*,
row_number() over( partition by invoice_num,
invoice_suffix
order by company_name
) as rn
from distinct_repex x
) y
where y.rn = 1;
The above should work in with MySQL and Microsoft SQL Server, if using the latest versions. It would be helpful if you say what database you are actually using.
Oracle equivalent of Postgres' DISTINCT ON?
The same effect can be replicated in Oracle either by using the first_value()
function or by using one of the rank()
or row_number()
functions.
Both variants also work in Postgres.
first_value()
select distinct col1,
first_value(col2) over (partition by col1 order by col2 asc)
from tmp
first_value
gives the first value for the partition, but repeats it for each row, so it is necessary to use it in combination with distinct
to get a single row for each partition.
row_number()
/ rank()
select col1, col2 from (
select col1, col2,
row_number() over (partition by col1 order by col2 asc) as rownumber
from tmp
) foo
where rownumber = 1
Replacing row_number()
with rank()
in this example yields the same result.
A feature of this variant is that it can be used to fetch the first N rows for a given partition (e.g. "last 3 updated") simply by changing rownumber = 1
to rownumber <= N
.
What is the difference between Postgres DISTINCT vs DISTINCT ON?
DISTINCT and DISTINCT ON have completely different semantics.
First the theory
DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.
For example, assume a table R with the following contents:
#table r;
a | b
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a
(6 rows)
SELECT distinct * from R will result:
# select distinct * from r;
a | b
---+---
1 | a
3 | d
2 | e
2 | b
3 | c
(5 rows)
Note that distinct applies to the entire list of projected attributes: thus
select distinct * from R
is semantically equivalent to
select distinct a,b from R
You cannot issue
select a, distinct b From R
DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.
DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.
Its syntax is:
SELECT DISTINCT ON (attributeList) <rest as any query>
For example:
SELECT DISTINCT ON (a) * from R
It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.
Example:
select * from r order by a;
a | b
---+---
1 | a
2 | e
2 | b
3 | c
3 | d
(5 rows)
Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:
SELECT DISTINCT on (a) * from r;
a | b
---+---
1 | a
2 | b
3 | c
(3 rows)
Some DBMS (most notably sqlite) will allow you to run this query:
SELECT a,b from R group by a;
And this give you a similar result.
Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).
For instance, if the primary key of R is a, then a->b and:
SELECT a,b FROM R group by a
is identical to:
SELECT DISTINCT on (a) a, b from r;
Now, back to your problem:
First query:
SELECT DISTINCT count(dimension1)
FROM data_table;
computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query
returns one tuple, which is always unique (hence DISTINCT
is redundant).
Query 2:
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
This is query in a query. Let me rewrite it for clarity:
WITH tmp_table AS (
SELECT DISTINCT ON (dimension1)
dimension1 FROM data_table
GROUP by dimension1)
SELECT count(*) from tmp_table
Let us compute first tmp_table. As I mentioned above,
let us first ignore the DISTINCT ON and do the rest of the
query. This is a group by by dimension1. Hence this part of the query
will result in one tuple per different value of dimension1.
Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence
this makes the DISTINCT ON superflouos (it does nothing).
The final count is simply a count of all the tuples in the group by.
As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):
SELECT (DISTINCT ON a) a
FROM R
and
SELECT a FROM R group by a
and
SELECT DISTINCT a FROM R
Warning
Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database.
In other words, the query might return different results for the same tables.
One interesting aspect
Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:
SELECT a, b FROM R group by a
is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a.
In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:
SELECT DISTINCT ON (a) a,b from R
Corollary
DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.
SELECT DISTINCT ON in Postgresql returns multiple rows
Distinct on
returns one row for each combination of the keys following the clause. In this case, each ts
would appear once. Which one? That is determined by the order by
. The keys following the distinct on
keys determine the first, so that would be the lowest version number.
Your query appears to be producing reasonable results.
If you want one row in the result set with the largest ts
, then use order by
and limit
/fetch first 1 row only
:
SELECT *
FROM tbl
WHERE clause = 100 AND
ts <= '2018-04-02 15:11:18.819000' AND
ts > '2018-04-02 15:06:18.819000'
ORDER BY ts DESC, version ASC
FETCH FIRST 1 ROW ONLY;
Note the ts DESC
in the ORDER BY
, to get the most recent value of ts
.
PostgreSQL DISTINCT ON with different ORDER BY
Documentation says:
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).
Official documentation
So you'll have to add the address_id
to the order by.
Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id
and that result sorted by purchased_at
then you're trying to solve a greatest N per group problem which can be solved by the following approaches:
The general solution that should work in most DBMSs:
SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
WHERE product_id = 1
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC
A more PostgreSQL-oriented solution based on @hkf's answer:
SELECT * FROM (
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 1
ORDER BY address_id, purchased_at DESC
) t
ORDER BY purchased_at DESC
Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another
postgresql - distinct only on one column
You can use PostgreSQL's Aggregate Functions
SELECT type
,string_agg(NAME, ',') "name"
FROM eatable
GROUP BY type;
Result:
type name
text text
------ ----------------------
fruit apple,banana
veggie brinjal,carrot,cabbage
OR
SELECT type
,array_agg(name) "name"
FROM eatable
GROUP BY type;
Result:
type name
text text[]
------ ------------------------
fruit {apple,banana}
veggie {brinjal,carrot,cabbage}
Demo
Related Topics
Dividing 2 Numbers in SQL Server
Best Way to Change Clustered Index (Pk) in SQL 2005
How to Write Select Query with Subquery Using Laravel Eloquent Querybuilder
How to Count Unique Pairs of Values in Sql
How to Use Wildcards in "In" MySQL Statement
Oracle SQL Merge to Insert and Delete But Not Update
Date Range for Set of Same Data
How to Automatically Reset a Sequence's Value to 0 Every Year in Oracle 10G
Hibernate 4.3.6 Querysyntaxexception: Path Expected for Join
Sql: Insert a Linebreak in Varchar String
Sql Query for Time In/Out Attendance
Selecting The Same Row Multiple Times
Most Efficient Method for Persisting Complex Types with Variable Schemas in Sql
Nesting Aggregate Functions - Sql
Sql Help: Select Statement Concatenate a One to Many Relationship
Varchar(Max) Ms SQL Server 2000, Problems