How to Perform the Same Aggregation on Every Column, Without Listing the Columns

How to perform the same aggregation on every column, without listing the columns?

You need dynamic SQL for that, which means you have to create a function or run a DO command. Since you cannot return values directly from the latter, a plpgsql function it is:

CREATE OR REPLACE function f_count_all(_tbl text
, OUT columns text[]
, OUT counts bigint[])
RETURNS record LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE (
SELECT 'SELECT
ARRAY[' || string_agg('''' || quote_ident(attname) || '''', ', ') || ']
, ARRAY[' || string_agg('count(' || quote_ident(attname) || ')' , ', ') || ']
FROM ' || _tbl
FROM pg_attribute
WHERE attrelid = _tbl::regclass
AND attnum >= 1 -- exclude tableoid & friends (neg. attnum)
AND NOT attisdropped -- exclude deleted columns
GROUP BY attrelid
)
INTO columns, counts;
END
$func$;

Call:

SELECT * FROM f_count_all('myschema.mytable');

Returns:

columns       | counts
--------------+--------
{c1, c2, c3} | {17, 1, 0}

More explanation and links about dynamic SQL and EXECUTE in this related question - or a couple more here on SO, try this search.

Related:

  • Count values for every column in a table

You could even try and return a polymorphic record type to get single columns dynamically, but that's rather complex and advanced. Probably too much effort for your case. More in this related answer.

Apply the same aggregate to every column in a table

First, since COUNT() only counts non-null values, your query can be simplified:

SELECT count(DISTINCT names) AS unique_names
,count(names) AS names_not_null
FROM table;

But that's the number of non-null values and contradicts your description:

count of the number of null values in the column

For that you would use:

count(*) - count(names) AS names_null

Since count(*) count all rows and count(names) only rows with non-null names.

Removed inferior alternative after hint by @Andriy.

To automate that for all columns build an SQL statement off of the catalog table pg_attribute dynamically. You can use EXECUTE in a PL/pgSQL function to execute it immediately. Find full code examples with links to the manual and explanation under these closely related questions:

  • How to perform the same aggregation on every column, without listing the columns?
  • postgresql - count (no null values) of each column in a table

Spark SQL: apply aggregate functions to a list of columns

There are multiple ways of applying aggregate functions to multiple columns.

GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum, which can be used directly as follows:

  • Python:

    df = sqlContext.createDataFrame(
    [(1.0, 0.3, 1.0), (1.0, 0.5, 0.0), (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2)],
    ("col1", "col2", "col3"))

    df.groupBy("col1").sum()

    ## +----+---------+-----------------+---------+
    ## |col1|sum(col1)| sum(col2)|sum(col3)|
    ## +----+---------+-----------------+---------+
    ## | 1.0| 2.0| 0.8| 1.0|
    ## |-1.0| -2.0|6.199999999999999| 0.7|
    ## +----+---------+-----------------+---------+
  • Scala

    val df = sc.parallelize(Seq(
    (1.0, 0.3, 1.0), (1.0, 0.5, 0.0),
    (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2))
    ).toDF("col1", "col2", "col3")

    df.groupBy($"col1").min().show

    // +----+---------+---------+---------+
    // |col1|min(col1)|min(col2)|min(col3)|
    // +----+---------+---------+---------+
    // | 1.0| 1.0| 0.3| 0.0|
    // |-1.0| -1.0| 0.6| 0.2|
    // +----+---------+---------+---------+

Optionally you can pass a list of columns which should be aggregated

df.groupBy("col1").sum("col2", "col3")

You can also pass dictionary / map with columns a the keys and functions as the values:

  • Python

    exprs = {x: "sum" for x in df.columns}
    df.groupBy("col1").agg(exprs).show()

    ## +----+---------+
    ## |col1|avg(col3)|
    ## +----+---------+
    ## | 1.0| 0.5|
    ## |-1.0| 0.35|
    ## +----+---------+
  • Scala

    val exprs = df.columns.map((_ -> "mean")).toMap
    df.groupBy($"col1").agg(exprs).show()

    // +----+---------+------------------+---------+
    // |col1|avg(col1)| avg(col2)|avg(col3)|
    // +----+---------+------------------+---------+
    // | 1.0| 1.0| 0.4| 0.5|
    // |-1.0| -1.0|3.0999999999999996| 0.35|
    // +----+---------+------------------+---------+

Finally you can use varargs:

  • Python

    from pyspark.sql.functions import min

    exprs = [min(x) for x in df.columns]
    df.groupBy("col1").agg(*exprs).show()
  • Scala

    import org.apache.spark.sql.functions.sum

    val exprs = df.columns.map(sum(_))
    df.groupBy($"col1").agg(exprs.head, exprs.tail: _*)

There are some other way to achieve a similar effect but these should more than enough most of the time.

See also:

  • Multiple Aggregate operations on the same column of a spark dataframe

Apply aggregate function to all columns on table with group by

You can use DISTINCT ON to get one row per group and join that with total scores calculated by a GROUP BY query. With this approach there will be score column containing value from some row in a group and a separate column for total score.

WITH total_scores AS (
SELECT age, name, SUM(score) AS total_score
FROM test_table
GROUP BY age, name
)
SELECT DISTINCT ON (tt.age, tt.name)
tt.*, ts.total_score
FROM test_table tt
JOIN total_scores ts ON tt.age = ts.age AND tt.name = ts.name

That said, it seems that you could normalize your data into two tables, one containing rows that have duplicate values (i.e. everything else except score) and another table containing score and a foreign key pointing to the first table.

Aggregate values in columns ONLY there is a difference

You can use:

select name, string_agg(distinct surname, ',') as surname, string_agg(age, ',')
from t
group by name;

This assumes that all names are unique -- that seems like a strong assumption for most datasets.

Dealing with many columns and groupby/agg in pandas (without manually assigning all aggregation functions)

You can use a dictionary of aggregation functions:

# default is sum
d = {c: 'sum' for c in df.columns}

# change a few other columns
d.update({'hash': list, 'name': 'first', 'group': 'first', 'trial': 'first'})

# aggregate
df.groupby('group', as_index=False).agg(d)

Alternative option to set up the dictionary:

d = {'first': ['name', 'group', 'trial'],
'sum': ['t1', 't2', 't3', 't4'],
list: ['hash']}

d = {k:v for v,l in d.items() for k in l}

df.groupby('group', as_index=False).agg(d)[df.columns]

NB. you an also combine both!

output:

                         hash name group  trial  t1  t2  t3  t4
0 [1AomKHNL56l, zsfAu5Q6I60] EP1 G01 clump 5 7 6 12
1 [2oKA7J1B3GL, v68fGHY8zx1] EP2 G02 green 3 15 9 46

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

Why do I need to explicitly specify all columns in a SQL GROUP BY clause - why not GROUP BY *?

It's hard to know exactly what the designers of the SQL language were thinking when they wrote the standard, but here's my opinion.

SQL, as a general rule, requires you to explicitly state your expectations and your intent. The language does not try to "guess what you meant", and automatically fill in the blanks. This is a good thing.

When you write a query the most important consideration is that it yields correct results. If you made a mistake, it's probably better that the SQL parser informs you, rather than making a guess about your intent and returning results that may not be correct. The declarative nature of SQL (where you state what you want to retrieve rather than the steps how to retrieve it) already makes it easy to inadvertently make mistakes. Introducing fuzziniess into the language syntax would not make this better.

In fact, every case I can think of where the language allows for shortcuts has caused problems. Take, for instance, natural joins - where you can omit the names of the columns you want to join on and allow the database to infer them based on column names. Once the column names change (as they naturally do over time) - the semantics of existing queries changes with them. This is bad ... very bad - you really don't want this kind of magic happening behind the scenes in your database code.

One consequence of this design choice, however, is that SQL is a verbose language in which you must explicitly express your intent. This can result in having to write more code than you may like, and gripe about why certain constructs are so verbose ... but at the end of the day - it is what it is.



Related Topics



Leave a reply



Submit