Crosstab with a Large or Undefined Number of Categories

Pivot table using crosstab and count

1. Static solution with a limited list of marking values :

SELECT year
, TO_CHAR( creation_date, 'Month') AS month
, COUNT(*) FILTER (WHERE marking = 'Delivered') AS Delivered
, COUNT(*) FILTER (WHERE marking = 'Not delivered') AS "Not delivered"
, COUNT(*) FILTER (WHERE marking = 'Not Received') AS "Not Received"
FROM invoices
GROUP BY 1,2

2. Full dynamic solution with a large list of marking values :

This proposal is an alternative solution to the crosstab solution as proposed in A and B.

The proposed solution here just requires a dedicated composite type which can be dynamically created and then it relies on the jsonb type and standard functions :

Starting from your query which counts the number of rows per year, month and marking value :

  • Using the jsonb_object_agg function, the resulting rows are first
    aggregated by year and month into jsonb objects whose jsonb keys
    correspond to the marking values and whose jsonb values
    correspond to the counts.
  • the resulting jsonb objects are then converted into records using the jsonb_populate_record function and the dedicated composite type.

First we dynamically create a composite type which corresponds to the ordered list of marking values :

CREATE OR REPLACE PROCEDURE create_composite_type() LANGUAGE plpgsql AS $$
DECLARE
column_list text ;
BEGIN
SELECT string_agg(DISTINCT quote_ident(marking) || ' bigint', ',' ORDER BY quote_ident(marking) || ' bigint' ASC)
INTO column_list
FROM invoices ;

EXECUTE 'DROP TYPE IF EXISTS composite_type' ;
EXECUTE 'CREATE TYPE composite_type AS (' || column_list || ')' ;
END ;
$$ ;

CALL create_composite_type() ;

Then the expected result is provided by the following query :

SELECT a.year
, TO_CHAR(a.year_month, 'Month') AS month
, (jsonb_populate_record( null :: composite_type
, jsonb_object_agg(a.marking, a.count)
)
).*
FROM
( SELECT year
, date_trunc('month', creation_date) AS year_month
, marking
, count(*) AS count
FROM invoices AS v
GROUP BY 1,2,3
) AS a
GROUP BY 1,2
ORDER BY month

Obviously, if the list of marking values may vary in time, then you have to recall the create_composite_type() procedure just before executing the query. If you don't update the composite_type, the query will still work (no error !) but some old marking values may be obsolete (not used anymore), and some new marking values may be missing in the query result (not displayed as columns).

See the full demo in dbfiddle.

PostgreSQL Crosstab - variable number of columns

As with all things imaginable and unimaginable, there is a way to do this with PostgreSQL. It looks like this:

WITH cte AS (
WITH minmax AS (
SELECT min(extract(month from order_date))::int,
max(extract(month from order_date))::int
FROM mytable
)
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
UNION
SELECT entity, extract(month from order_date)::int, qty FROM mytable
)
SELECT entity, array_agg(sum) AS values
FROM (
SELECT entity, mon, sum(qty) FROM cte
GROUP BY 1, 2) sub
GROUP BY 1
ORDER BY 1;

A few words of explanation:

The standard way to produce an array inside a SQL statement is to use the array_agg() function. Your problem is that you have months without data and then array_agg() happily produces nothing, leaving you with arrays of unequal length and no information on where in the time period the data comes from. You can solve this by adding 0's for every combination of 'entity' and the months in the period of interest. That is what this snippet of code does:

SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths

All those 0's are UNIONed to the actual data from 'mytable' and then (in the main query) you can first sum up the quantities by entity and month and subsequently aggregate those sums into an array for each entity. Since it is a double aggregation you need the sub-query. (You could also sum the quantities in the UNION but then you would also need a sub-query because UNIONs don't allow aggregation.)

The minmax CTE can be adjusted to include the year as well (your sample data doesn't need it). Do note that the actual min and max values are immaterial to the index in the array: if min is 743 it will still occupy the first position in the array; those values are only used for GROUPing, not indexing.

SQLFiddle

For ease of use you could wrap this query up in a SQL language function with parameters for the starting and ending month. Adjust the minmax CTE to produce appropriate min and max values for the generate_series() call and in the UNION filter the rows from 'mytable' to be considered.

Create Postgresql Crosstab Query with multiple categories

I found a solution, though it's not ideal: merge the first two with a unique character, get the crosstab query with a CTE, and split the resulting columns. It looks something like:

WITH crosstb AS (
SELECT *
FROM crosstab(
$$
-- Concatenate date and category columns
SELECT date || '_' || category, subcategory, count(*)
-- ...
GROUP BY 1, 2
$$,
-- Specify data columns to ensure null-count rows are included
$$VALUES ('First'), ('Second'), ('Third')$$
)
AS ct(datecat text, First numeric, Second numeric, Third numeric)
)
SELECT
-- Split datecat column into separate again
(regexp_split_to_array(datecat, '_'))[1]::date as Date,
(regexp_split_to_array(datecat, '_'))[2] as category,
COALESCE(First, 0),
COALESCE(Second, 0),
COALESCE(Third, 0)
FROM crosstb;

Can the categories in the postgres tablefunc crosstab() function be integers?

If I understand correctly your tables look something like this:

CREATE TABLE lab_tests (
id INTEGER PRIMARY KEY,
name VARCHAR(500)
);

CREATE TABLE lab_tests_results (
id INTEGER PRIMARY KEY,
lab_tests_id INTEGER REFERENCES lab_tests (id),
result TEXT
);

And your data looks something like this:

INSERT INTO lab_tests (id, name) 
VALUES (1, 'test1'),
(2, 'test2');

INSERT INTO lab_tests_results (id, lab_tests_id, result)
VALUES (1,1,'result1'),
(2,1,'result2'),
(3,2,'result3'),
(4,2,'result4'),
(5,2,'result5');

First of all crosstab is part of tablefunc, you need to enable it:

CREATE EXTENSION tablefunc;

You need to run it one per database as per this answer.

The final query will look like this:

SELECT *
FROM crosstab(
'SELECT lt.name::TEXT, lt.id, ltr.result
FROM lab_tests AS lt
JOIN lab_tests_results ltr ON ltr.lab_tests_id = lt.id'
) AS ct(test_name text, result_1 text, result_2 text, result_3 text);

Explanation:
The crosstab() function takes a text of a query which should return 3 columns; (1) a column for name of a group, (2) a column for grouping, (3) the value. The wrapping query just selects all the values those crosstab() returns and defines the list of columns after (the part after AS). First is the category name (test_name) and then the values (result_1, result_2). In my query I'll get up to 3 results. If I have more then 3 results then I won't see them, If I have less then 3 results I'll get nulls.

The result for this query is:

test_name |result_1 |result_2 |result_3
---------------------------------------
test1 |result1 |result2 |<null>
test2 |result3 |result4 |result5

How to get a cross tabulation with pandas crosstab that would display the frequency of multiple values of a column variable?

Use sum with subset, but if use small random df there can be problem you get always another values so values of columns will be different. If use np.random.seed(100) get same test output as my answer.

table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.randint(0,5, size=(5,6)), columns=list('ABCDEF'))
table = pd.crosstab(index=df['A'], columns=df['B'])
table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)
print (table)
B 0 1 2 3 4 (1+2+3)
A
0 1 0 0 0 1 0
1 0 0 0 1 0 1
2 0 0 1 0 0 1
3 0 1 0 0 0 1


Related Topics



Leave a reply



Submit