Pivot table using crosstab and count
1. Static solution with a limited list of marking
values :
SELECT year
, TO_CHAR( creation_date, 'Month') AS month
, COUNT(*) FILTER (WHERE marking = 'Delivered') AS Delivered
, COUNT(*) FILTER (WHERE marking = 'Not delivered') AS "Not delivered"
, COUNT(*) FILTER (WHERE marking = 'Not Received') AS "Not Received"
FROM invoices
GROUP BY 1,2
2. Full dynamic solution with a large list of marking
values :
This proposal is an alternative solution to the crosstab
solution as proposed in A and B.
The proposed solution here just requires a dedicated composite type
which can be dynamically created and then it relies on the jsonb
type and standard functions :
Starting from your query which counts the number of rows per year, month and marking
value :
- Using the
jsonb_object_agg
function, the resulting rows are first
aggregated by year and month intojsonb
objects whosejsonb keys
correspond to themarking
values and whosejsonb values
correspond to the counts. - the resulting
jsonb
objects are then converted into records using thejsonb_populate_record
function and the dedicated composite type.
First we dynamically create a composite type
which corresponds to the ordered list of marking
values :
CREATE OR REPLACE PROCEDURE create_composite_type() LANGUAGE plpgsql AS $$
DECLARE
column_list text ;
BEGIN
SELECT string_agg(DISTINCT quote_ident(marking) || ' bigint', ',' ORDER BY quote_ident(marking) || ' bigint' ASC)
INTO column_list
FROM invoices ;
EXECUTE 'DROP TYPE IF EXISTS composite_type' ;
EXECUTE 'CREATE TYPE composite_type AS (' || column_list || ')' ;
END ;
$$ ;
CALL create_composite_type() ;
Then the expected result is provided by the following query :
SELECT a.year
, TO_CHAR(a.year_month, 'Month') AS month
, (jsonb_populate_record( null :: composite_type
, jsonb_object_agg(a.marking, a.count)
)
).*
FROM
( SELECT year
, date_trunc('month', creation_date) AS year_month
, marking
, count(*) AS count
FROM invoices AS v
GROUP BY 1,2,3
) AS a
GROUP BY 1,2
ORDER BY month
Obviously, if the list of marking
values may vary in time, then you have to recall the create_composite_type()
procedure just before executing the query. If you don't update the composite_type
, the query will still work (no error !) but some old marking values may be obsolete (not used anymore), and some new marking values may be missing in the query result (not displayed as columns).
See the full demo in dbfiddle.
PostgreSQL Crosstab - variable number of columns
As with all things imaginable and unimaginable, there is a way to do this with PostgreSQL. It looks like this:
WITH cte AS (
WITH minmax AS (
SELECT min(extract(month from order_date))::int,
max(extract(month from order_date))::int
FROM mytable
)
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
UNION
SELECT entity, extract(month from order_date)::int, qty FROM mytable
)
SELECT entity, array_agg(sum) AS values
FROM (
SELECT entity, mon, sum(qty) FROM cte
GROUP BY 1, 2) sub
GROUP BY 1
ORDER BY 1;
A few words of explanation:
The standard way to produce an array inside a SQL statement is to use the array_agg()
function. Your problem is that you have months without data and then array_agg()
happily produces nothing, leaving you with arrays of unequal length and no information on where in the time period the data comes from. You can solve this by adding 0's for every combination of 'entity' and the months in the period of interest. That is what this snippet of code does:
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
All those 0's are UNION
ed to the actual data from 'mytable' and then (in the main query) you can first sum up the quantities by entity and month and subsequently aggregate those sums into an array for each entity. Since it is a double aggregation you need the sub-query. (You could also sum the quantities in the UNION
but then you would also need a sub-query because UNION
s don't allow aggregation.)
The minmax
CTE can be adjusted to include the year as well (your sample data doesn't need it). Do note that the actual min
and max
values are immaterial to the index in the array: if min
is 743 it will still occupy the first position in the array; those values are only used for GROUP
ing, not indexing.
SQLFiddle
For ease of use you could wrap this query up in a SQL language function with parameters for the starting and ending month. Adjust the minmax
CTE to produce appropriate min
and max
values for the generate_series()
call and in the UNION
filter the rows from 'mytable' to be considered.
Create Postgresql Crosstab Query with multiple categories
I found a solution, though it's not ideal: merge the first two with a unique character, get the crosstab query with a CTE, and split the resulting columns. It looks something like:
WITH crosstb AS (
SELECT *
FROM crosstab(
$$
-- Concatenate date and category columns
SELECT date || '_' || category, subcategory, count(*)
-- ...
GROUP BY 1, 2
$$,
-- Specify data columns to ensure null-count rows are included
$$VALUES ('First'), ('Second'), ('Third')$$
)
AS ct(datecat text, First numeric, Second numeric, Third numeric)
)
SELECT
-- Split datecat column into separate again
(regexp_split_to_array(datecat, '_'))[1]::date as Date,
(regexp_split_to_array(datecat, '_'))[2] as category,
COALESCE(First, 0),
COALESCE(Second, 0),
COALESCE(Third, 0)
FROM crosstb;
Can the categories in the postgres tablefunc crosstab() function be integers?
If I understand correctly your tables look something like this:
CREATE TABLE lab_tests (
id INTEGER PRIMARY KEY,
name VARCHAR(500)
);
CREATE TABLE lab_tests_results (
id INTEGER PRIMARY KEY,
lab_tests_id INTEGER REFERENCES lab_tests (id),
result TEXT
);
And your data looks something like this:
INSERT INTO lab_tests (id, name)
VALUES (1, 'test1'),
(2, 'test2');
INSERT INTO lab_tests_results (id, lab_tests_id, result)
VALUES (1,1,'result1'),
(2,1,'result2'),
(3,2,'result3'),
(4,2,'result4'),
(5,2,'result5');
First of all crosstab
is part of tablefunc
, you need to enable it:
CREATE EXTENSION tablefunc;
You need to run it one per database as per this answer.
The final query will look like this:
SELECT *
FROM crosstab(
'SELECT lt.name::TEXT, lt.id, ltr.result
FROM lab_tests AS lt
JOIN lab_tests_results ltr ON ltr.lab_tests_id = lt.id'
) AS ct(test_name text, result_1 text, result_2 text, result_3 text);
Explanation:
The crosstab()
function takes a text of a query which should return 3 columns; (1) a column for name of a group, (2) a column for grouping, (3) the value. The wrapping query just selects all the values those crosstab()
returns and defines the list of columns after (the part after AS
). First is the category name (test_name
) and then the values (result_1
, result_2
). In my query I'll get up to 3 results. If I have more then 3 results then I won't see them, If I have less then 3 results I'll get nulls.
The result for this query is:
test_name |result_1 |result_2 |result_3
---------------------------------------
test1 |result1 |result2 |<null>
test2 |result3 |result4 |result5
How to get a cross tabulation with pandas crosstab that would display the frequency of multiple values of a column variable?
Use sum
with subset, but if use small random df there can be problem you get always another values so values of columns will be different. If use np.random.seed(100)
get same test output as my answer.
table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(0,5, size=(5,6)), columns=list('ABCDEF'))
table = pd.crosstab(index=df['A'], columns=df['B'])
table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)
print (table)
B 0 1 2 3 4 (1+2+3)
A
0 1 0 0 0 1 0
1 0 0 0 1 0 1
2 0 0 1 0 0 1
3 0 1 0 0 0 1
Related Topics
Executing SQL Server Agent Job from a Stored Procedure and Returning Job Result
SQL Server Updating a Time Stamp Column
Oracle SQL:Get All Integers Between Two Numbers
Space Used by Nulls in Database
Select Random Row from a Postgresql Table with Weighted Row Probabilities
Tsql Select into Temp Table from Dynamic SQL
Group Consecutive Rows of Same Value Using Time Spans
Running Total by Grouped Records in Table
Return All Possible Combinations of Values Within a Single Column in SQL
SQL Server Row_Number() on SQL Server 2000
How to Dynamically Use Tg_Table_Name in Postgresql 8.2