Crosstab with 2 (Or More) Row Names

Postgresql crosstab query with multiple row name columns

crosstab() expects the following columns from its input query (1st parameter), in this order:

  1. a row_name
  2. (optional) extra columns
  3. a category (matching values in 2nd crosstab parameter)
  4. a value

You don't have a row_name. Add a surrogate row_name with the window function dense_rank().

Your question leaves room for interpretation. Let's add sample rows for demonstration:

INSERT INTO facts (eff_date, update_date, symbol_id, data_type_id, source_id)
VALUES
(now(), now(), 1, 5, 'foo')
, (now(), now(), 1, 6, 'foo')
, (now(), now(), 1, 7, 'foo')
, (now(), now(), 1, 6, 'bar')
, (now(), now(), 1, 7, 'bar')
, (now(), now(), 1, 23, 'bar')
, (now(), now(), 1, 5, 'baz')
, (now(), now(), 1, 23, 'baz'); -- only two rows for 'baz'

Interpretation #1: first N values

You want to list the first N values of data_type_id (the smallest, if there are more) for each distinct (source_id, symbol_id, eff_date).

For this, you also need a synthetic category, can be synthesized with row_number(). The basic query to produce input to crosstab():

SELECT dense_rank() OVER (ORDER BY eff_date, symbol_id, source_id)::int AS row_name
, eff_date, symbol_id, source_id -- extra columns
, row_number() OVER (PARTITION BY eff_date, symbol_id, source_id
ORDER BY data_type_id)::int AS category
, data_type_id AS value
FROM facts
ORDER BY row_name, category;

Crosstab query:

SELECT *
FROM crosstab(
'SELECT dense_rank() OVER (ORDER BY eff_date, symbol_id, source_id)::int AS row_name
, eff_date, symbol_id, source_id -- extra columns
, row_number() OVER (PARTITION BY eff_date, symbol_id, source_id
ORDER BY data_type_id)::int AS category
, data_type_id AS value
FROM facts
ORDER BY row_name, category'
, 'VALUES (1), (2), (3)'
) AS (row_name int, eff_date timestamp, symbol_id int, source_id char(3)
, datatype_1 int, datatype_2 int, datatype_3 int);

Results:


row_name | eff_date | symbol_id | source_id | datatype_1 | datatype_2 | datatype_3
-------: | :--------------| --------: | :-------- | ---------: | ---------: | ---------:
1 | 2017-04-10 ... | 1 | bar | 6 | 7 | 23
2 | 2017-04-10 ... | 1 | baz | 5 | 23 | null
3 | 2017-04-10 ... | 1 | foo | 5 | 6 | 7

Interpretation #2: actual values in column names

You want to append actual values of data_type_id to the column names datatypeValue1, ... DatatypeValueN. One ore more of these:

SELECT DISTINCT data_type_id FROM facts ORDER BY 1;

5, 6, 7, 23 in the example. Then actual display values can be just boolean (or the redundant value?). Basic query:

SELECT dense_rank() OVER (ORDER BY eff_date, symbol_id, source_id)::int AS row_name
, eff_date, symbol_id, source_id -- extra columns
, data_type_id AS category
, TRUE AS value
FROM facts
ORDER BY row_name, category;

Crosstab query:

SELECT *
FROM crosstab(
'SELECT dense_rank() OVER (ORDER BY eff_date, symbol_id, source_id)::int AS row_name
, eff_date, symbol_id, source_id -- extra columns
, data_type_id AS category
, TRUE AS value
FROM facts
ORDER BY row_name, category'
, 'VALUES (5), (6), (7), (23)' -- actual values
) AS (row_name int, eff_date timestamp, symbol_id int, source_id char(3)
, datatype_5 bool, datatype_6 bool, datatype_7 bool, datatype_23 bool);

Result:


eff_date | symbol_id | source_id | datatype_5 | datatype_6 | datatype_7 | datatype_23
:--------------| --------: | :-------- | :--------- | :--------- | :--------- | :----------
2017-04-10 ... | 1 | bar | null | t | t | t
2017-04-10 ... | 1 | baz | t | null | null | t
2017-04-10 ... | 1 | foo | t | t | t | null

dbfiddle here

Related:

  • Crosstab function in Postgres returning a one row output when I expect multiple rows
  • Dynamic alternative to pivot with CASE and GROUP BY
  • Postgres - Transpose Rows to Columns

Pandas Crosstab when used with Pandas cut : Row names of the output crosstab are mismatched

This seems to be an issue with Pandas version 0.23.0. In the latest version, 0.23.2, it correctly assigns the labels.

import pandas as pd
values = [5100,5400,9400,10000,16000,10500,0,0,0,87500,14000,11250]
d = {'continuous': values,'label': [0]*6+[1]*6}
df = pd.DataFrame(data=d)
bins = [-1, 0, 5000000]
df['discrete'] = pd.cut(df['continuous'], bins=bins)

print(pd.crosstab(df.discrete, df.label))

Gives the desired output

label         0  1
discrete
(-1, 0] 0 3
(0, 5000000] 6 3

crosstabs with multiple top-levels rows/cols

For future reference, here's my very inelegant solution, using the tidyverse:

yelements<-c("Q1", "Q2", "Q3") # etc...
xelements<-c("region","gender","age_group","martial_status")
Rows<-NULL # a helper table to create each set of rows individually
FullXTab<-NULL
for (i in yelements) { # this is a vector of names of factor columns that will form the rows of the xtabs
for (w in xelements) { # vector of names of factor columns that will form the columns of the xtabs
x<-ftable(data[c(i,w)])
x<-as.data.frame(as.matrix(x))
names(x)<-paste(w,names(x),sep="_") # add the name of the variable to the names of each of the levels that will form individual columns to differentiate them
names(x)<-gsub("V1","NA",names(x)) # blank items will turn into meaningless "V1" columns, so I replace that with NA
if(is.null(Rows)) {
x<-rownames_to_column(x,"answer") # make the y-axis factor levels into their own column
Rows<-x
} else {Rows<-bind_cols(Rows,x)}
}
Rows$Q<-i # create a column with the name of the y-axis vector, to differentiate different vectors with similar levels, e.g. question numbers
if(is.null(FullXTab)) {FullXTab<-Rows} else {FullXTab<-bind_rows(FullXTab,Rows)}
Rows<-NULL
}

This creates first a set of row for the first element in xelements, with a table each for each of the elements of yelements, and then binds them together to one "wide" table; and then binds each of those sets of rows into a full table.
I'm sure there's a cleaner way to do this...

cross table with a column containing multiple values in R

Here is one approach for you. You split genres with separate_rows() and create a temporary data frame. Then, you use table() as you did.

library(dplyr)
library(tidyr)

mydf %>%
separate_rows(genres, sep = ", ") -> foo

table(foo$genres, foo$class_rentabilite)

# High Low Medium
# Action 1 0 1
# Adventure 0 0 1
# Comedy 1 0 0
# Crime 2 0 1
# Drama 3 1 1
# Sci-Fi 0 0 1
# Thriller 1 0 1

DATA

mydf <- structure(list(genres = c("Crime, Drama", "Action, Crime, Drama, Thriller", 
"Action, Adventure, Sci-Fi, Thriller", "Drama", "Crime, Drama",
"Comedy, Drama"), class_rentabilite = c("Medium", "High", "Medium",
"Low", "High", "High")), .Names = c("genres", "class_rentabilite"
), row.names = c(NA, -6L), class = "data.frame")

Crosstab function in Postgres returning a one row output when I expect multiple rows

This query generates your desired output:

SELECT id, scenario, period, p1, p2, p3, p4  -- all except aux column rn
FROM crosstab(
'SELECT row_number() OVER (ORDER BY id, scenario, period)::int AS rn
, id, scenario, period, period, ct
FROM m
ORDER BY 1'
, 'VALUES (1), (2), (3), (4)'
) AS (rn int, id int, scenario int, period int, p1 int, p2 int, p3 int, p4 int);

Two special difficulties:

  1. You don't have a single unique column for the row_name yet. I use row_number() to generate the surrogate key: rn. I removed it from the outer SELECT to match your desired result.

    The way you tried it, id is taken to be the row_name and all input rows are aggregated into a single output row.

  2. You want additional columns (scenario and period) in the result, which must come after the row_name and before the category. You must list period twice to get the original column additionally - redundant as though it may seem.

Basics:

  • PostgreSQL Crosstab Query

Related to this particular case:

  • Pivot on Multiple Columns using Tablefunc

Typically, you would have a query like this:

SELECT id, scenario, p1, p2, p3, p4  -- all except aux column rn
FROM crosstab(
'SELECT rank() OVER (ORDER BY id, scenario)::int AS rn
, id, scenario, period, ct
FROM m
ORDER BY 1'
, 'VALUES (1), (2), (3), (4)'
) AS (rn int, id int, scenario int, p1 int, p2 int, p3 int, p4 int);

With an output like this:

id   scenario   p1   p2   p3   p4
2 1 1 1 1 1
2 2 1 1 1 1
2 3 1 1 1 1

Note the use of rank() instead of row_number() to group same combinations of (id, scenario) together.

The result makes more sense if counts are not all 1.

Using crosstab on a query with a composite key (multiple columns)

Using ARRAY to solve the composite key problem

I think the real issue that you're having is that your sub_location is part of your primary identifier (name) for the purposes of cross. And, not what crosstab calls an extra column.

The "extra" columns are expected to be the same for all rows with the same row_name value.

So in essence, composite keys forming a name must be serialized by the user. You can still make this work serializing to a SQL ARRAY of type text[], using ARRAY[location, sub_location]::text[].

SELECT *
FROM crosstab(
$$ SELECT ARRAY[location, sub_location]::text[], step, amount FROM loc ORDER BY 1, 2, 3; $$,
$$ SELECT DISTINCT step FROM loc ORDER BY 1; $$
) AS t(location text[], step_1 int, step_2 int, step_3 int );

location | step_1 | step_2 | step_3
-------------+--------+--------+--------
{100,100_A} | 2 | 7 | 6
{100,100_B} | 5 | 8 | 9
(2 rows)

Exploiting your sublocation having the actual location in it

Now, because sub-location in your specific case has data of location we can make this even shorter by switching the ordering around. I wouldn't have sub-location stored in the table with 100_, but we can make use of that here. To be clear, this wouldn't work if location: 100, sublocation: 'A' which is the way I would store it.

SELECT *
FROM crosstab(
$$ SELECT sub_location, location, step, amount FROM loc ORDER BY 1, 2, 3; $$,
$$ SELECT DISTINCT step FROM loc ORDER BY 1; $$
) AS t(sub_location text, location int, step_1 int, step_2 int, step_3 int );
sub_location | location | step_1 | step_2 | step_3
--------------+----------+--------+--------+--------
100_A | 100 | 2 | 7 | 6
100_B | 100 | 5 | 8 | 9
(2 rows)

This eliminates the complexity of the call to ARRAY though.

Simplifying for your use case

We can also just drop `location at this point or switch the order around in a parent query.

SELECT *
FROM crosstab(
$$ SELECT sub_location, step, amount FROM loc ORDER BY 1, 2, 3; $$,
$$ SELECT DISTINCT step FROM loc ORDER BY 1; $$
) AS t(location_full text, step_1 int, step_2 int, step_3 int );

location_full | step_1 | step_2 | step_3
---------------+--------+--------+--------
100_A | 2 | 7 | 6
100_B | 5 | 8 | 9
(2 rows)

Not sure which method above works best for you. Don't forget to CREATE EXTENSION tablefunc; Of course, it's totally subjective whether or not this is easier than the not-crosstab'ed version.

Cross table with two datasets (one as the row and the other as the column)

Where both datasets are coming from the same relational data source, the simplest way to achieve this would normally be:

  • Replace the existing two datasets with a single dataset, in which the two original datasets are cross-joined to each other;
  • create a crosstab from the new dataset, with the new dataset columns as the data cube groups.


Related Topics



Leave a reply



Submit