How to perform the same aggregation on every column, without listing the columns?
You need dynamic SQL for that, which means you have to create a function or run a DO
command. Since you cannot return values directly from the latter, a plpgsql function it is:
CREATE OR REPLACE function f_count_all(_tbl text
, OUT columns text[]
, OUT counts bigint[])
RETURNS record LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE (
SELECT 'SELECT
ARRAY[' || string_agg('''' || quote_ident(attname) || '''', ', ') || ']
, ARRAY[' || string_agg('count(' || quote_ident(attname) || ')' , ', ') || ']
FROM ' || _tbl
FROM pg_attribute
WHERE attrelid = _tbl::regclass
AND attnum >= 1 -- exclude tableoid & friends (neg. attnum)
AND NOT attisdropped -- exclude deleted columns
GROUP BY attrelid
)
INTO columns, counts;
END
$func$;
Call:
SELECT * FROM f_count_all('myschema.mytable');
Returns:
columns | counts
--------------+--------
{c1, c2, c3} | {17, 1, 0}
More explanation and links about dynamic SQL and EXECUTE
in this related question - or a couple more here on SO, try this search.
Related:
- Count values for every column in a table
You could even try and return a polymorphic record type to get single columns dynamically, but that's rather complex and advanced. Probably too much effort for your case. More in this related answer.
Apply the same aggregate to every column in a table
First, since COUNT()
only counts non-null values, your query can be simplified:
SELECT count(DISTINCT names) AS unique_names
,count(names) AS names_not_null
FROM table;
But that's the number of non-null values and contradicts your description:
count of the number of null values in the column
For that you would use:
count(*) - count(names) AS names_null
Since count(*)
count all rows and count(names)
only rows with non-null names
.
Removed inferior alternative after hint by @Andriy.
To automate that for all columns build an SQL statement off of the catalog table pg_attribute
dynamically. You can use EXECUTE
in a PL/pgSQL function to execute it immediately. Find full code examples with links to the manual and explanation under these closely related questions:
- How to perform the same aggregation on every column, without listing the columns?
- postgresql - count (no null values) of each column in a table
Spark SQL: apply aggregate functions to a list of columns
There are multiple ways of applying aggregate functions to multiple columns.
GroupedData
class provides a number of methods for the most common functions, including count
, max
, min
, mean
and sum
, which can be used directly as follows:
Python:
df = sqlContext.createDataFrame(
[(1.0, 0.3, 1.0), (1.0, 0.5, 0.0), (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2)],
("col1", "col2", "col3"))
df.groupBy("col1").sum()
## +----+---------+-----------------+---------+
## |col1|sum(col1)| sum(col2)|sum(col3)|
## +----+---------+-----------------+---------+
## | 1.0| 2.0| 0.8| 1.0|
## |-1.0| -2.0|6.199999999999999| 0.7|
## +----+---------+-----------------+---------+Scala
val df = sc.parallelize(Seq(
(1.0, 0.3, 1.0), (1.0, 0.5, 0.0),
(-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2))
).toDF("col1", "col2", "col3")
df.groupBy($"col1").min().show
// +----+---------+---------+---------+
// |col1|min(col1)|min(col2)|min(col3)|
// +----+---------+---------+---------+
// | 1.0| 1.0| 0.3| 0.0|
// |-1.0| -1.0| 0.6| 0.2|
// +----+---------+---------+---------+
Optionally you can pass a list of columns which should be aggregated
df.groupBy("col1").sum("col2", "col3")
You can also pass dictionary / map with columns a the keys and functions as the values:
Python
exprs = {x: "sum" for x in df.columns}
df.groupBy("col1").agg(exprs).show()
## +----+---------+
## |col1|avg(col3)|
## +----+---------+
## | 1.0| 0.5|
## |-1.0| 0.35|
## +----+---------+Scala
val exprs = df.columns.map((_ -> "mean")).toMap
df.groupBy($"col1").agg(exprs).show()
// +----+---------+------------------+---------+
// |col1|avg(col1)| avg(col2)|avg(col3)|
// +----+---------+------------------+---------+
// | 1.0| 1.0| 0.4| 0.5|
// |-1.0| -1.0|3.0999999999999996| 0.35|
// +----+---------+------------------+---------+
Finally you can use varargs:
Python
from pyspark.sql.functions import min
exprs = [min(x) for x in df.columns]
df.groupBy("col1").agg(*exprs).show()Scala
import org.apache.spark.sql.functions.sum
val exprs = df.columns.map(sum(_))
df.groupBy($"col1").agg(exprs.head, exprs.tail: _*)
There are some other way to achieve a similar effect but these should more than enough most of the time.
See also:
- Multiple Aggregate operations on the same column of a spark dataframe
Apply aggregate function to all columns on table with group by
You can use DISTINCT ON
to get one row per group and join that with total scores calculated by a GROUP BY
query. With this approach there will be score
column containing value from some row in a group and a separate column for total score.
WITH total_scores AS (
SELECT age, name, SUM(score) AS total_score
FROM test_table
GROUP BY age, name
)
SELECT DISTINCT ON (tt.age, tt.name)
tt.*, ts.total_score
FROM test_table tt
JOIN total_scores ts ON tt.age = ts.age AND tt.name = ts.name
That said, it seems that you could normalize your data into two tables, one containing rows that have duplicate values (i.e. everything else except score
) and another table containing score
and a foreign key pointing to the first table.
Aggregate values in columns ONLY there is a difference
You can use:
select name, string_agg(distinct surname, ',') as surname, string_agg(age, ',')
from t
group by name;
This assumes that all name
s are unique -- that seems like a strong assumption for most datasets.
Dealing with many columns and groupby/agg in pandas (without manually assigning all aggregation functions)
You can use a dictionary of aggregation functions:
# default is sum
d = {c: 'sum' for c in df.columns}
# change a few other columns
d.update({'hash': list, 'name': 'first', 'group': 'first', 'trial': 'first'})
# aggregate
df.groupby('group', as_index=False).agg(d)
Alternative option to set up the dictionary:
d = {'first': ['name', 'group', 'trial'],
'sum': ['t1', 't2', 't3', 't4'],
list: ['hash']}
d = {k:v for v,l in d.items() for k in l}
df.groupby('group', as_index=False).agg(d)[df.columns]
NB. you an also combine both!
output:
hash name group trial t1 t2 t3 t4
0 [1AomKHNL56l, zsfAu5Q6I60] EP1 G01 clump 5 7 6 12
1 [2oKA7J1B3GL, v68fGHY8zx1] EP2 G02 green 3 15 9 46
Aggregate multiple columns at once
We can use the formula method of aggregate
. The variables on the 'rhs' of ~
are the grouping variables while the .
represents all other variables in the 'df1' (from the example, we assume that we need the mean
for all the columns except the grouping), specify the dataset and the function (mean
).
aggregate(.~id1+id2, df1, mean)
Or we can use summarise_each
from dplyr
after grouping (group_by
)
library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))
Or using summarise
with across
(dplyr
devel version - ‘0.8.99.9000’
)
df1 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))
Or another option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
, grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD
) and get the mean
.
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]
data
df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b",
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY *?
It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.
SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.
When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.
In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.
One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.
Related Topics
How to Rollback When an Error Occurs While Executing SQL Loader Command
Transposing SQLite Rows and Columns with Average Per Hour
SQL Group and Sum by Month - Default to Zero
Remove Ascii Extended Characters 128 Onwards (Sql)
Syntax Error When Using Row_Number in SQLite3
How to Update a Blob in SQL Server Using Tsql
Delete Primary Key Row Which Is Foreign Key of Another Table
Bigquery - Joining on Multiple Conditions Using Subqueries and or Statements
How to Add Sequenced Number Based on Sorted Value in Query in Access
Updating Row with Subquery Returning Multiple Rows
Get the Row with the Highest Value in MySQL
"Rolling Up" Groups in Jaspersoft Ireport
Syntax Error: Unexpected End of File
SQL Query That Distinguishes Between ß and Ss