Matplotlib: Group Boxplots

matplotlib: Group boxplots

How about using colors to differentiate between "apples" and "oranges" and spacing to separate "A", "B" and "C"?

Something like this:

from pylab import plot, show, savefig, xlim, figure, \
hold, ylim, legend, boxplot, setp, axes

# function for setting the colors of the box plots pairs
def setBoxColors(bp):
setp(bp['boxes'][0], color='blue')
setp(bp['caps'][0], color='blue')
setp(bp['caps'][1], color='blue')
setp(bp['whiskers'][0], color='blue')
setp(bp['whiskers'][1], color='blue')
setp(bp['fliers'][0], color='blue')
setp(bp['fliers'][1], color='blue')
setp(bp['medians'][0], color='blue')

setp(bp['boxes'][1], color='red')
setp(bp['caps'][2], color='red')
setp(bp['caps'][3], color='red')
setp(bp['whiskers'][2], color='red')
setp(bp['whiskers'][3], color='red')
setp(bp['fliers'][2], color='red')
setp(bp['fliers'][3], color='red')
setp(bp['medians'][1], color='red')

# Some fake data to plot
A= [[1, 2, 5,], [7, 2]]
B = [[5, 7, 2, 2, 5], [7, 2, 5]]
C = [[3,2,5,7], [6, 7, 3]]

fig = figure()
ax = axes()
hold(True)

# first boxplot pair
bp = boxplot(A, positions = [1, 2], widths = 0.6)
setBoxColors(bp)

# second boxplot pair
bp = boxplot(B, positions = [4, 5], widths = 0.6)
setBoxColors(bp)

# thrid boxplot pair
bp = boxplot(C, positions = [7, 8], widths = 0.6)
setBoxColors(bp)

# set axes limits and labels
xlim(0,9)
ylim(0,9)
ax.set_xticklabels(['A', 'B', 'C'])
ax.set_xticks([1.5, 4.5, 7.5])

# draw temporary red and blue lines and use them to create a legend
hB, = plot([1,1],'b-')
hR, = plot([1,1],'r-')
legend((hB, hR),('Apples', 'Oranges'))
hB.set_visible(False)
hR.set_visible(False)

savefig('boxcompare.png')
show()

grouped box plot

Python Side by side box plots after groupby in Matplotlib

You could add a position to the call to plt.boxplot() to avoid that the boxplot will be drawn twice at the same position:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.rand(10, 1), columns=['A'])
df['B'] = [1, 2, 1, 1, 1, 2, 1, 2, 2, 1]
for n, grp in df.groupby('B'):
plt.boxplot(x='A', data=grp, positions=[n])
plt.xticks([1, 2], ['Label 1', 'Label 2'])
plt.show()

boxplot from groupby

Or you could use seaborn let the grouping. You can replace the numbers in the column to have labels.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame(np.random.rand(10, 1), columns=['A'])
df['B'] = [1, 2, 1, 1, 1, 2, 1, 2, 2, 1]
df['B'] = df['B'].replace({1: 'Label 1', 2: 'Label 2'})
sns.boxplot(data=df, x='B', y='A' )
plt.show()

seaborn boxplot

Boxplot by two groups in pandas

As @Prune mentioned, the immediate issue is that your groupby() returns four groups (AX, AY, BX, BY), so first fix the indexing and then clean up a couple more issues:

  1. Change axs[i] to axs[i//2] to put groups 0 and 1 on axs[0] and groups 2 and 3 on axs[1].
  2. Add positions=[i] to place the boxplots side by side rather than stacked.
  3. Set the title and xticklabels after plotting (I'm not aware of how to do this in the main loop).
for i, g in enumerate(df_plots.groupby(['Group', 'Type'])):
g[1].boxplot(ax=axs[i//2], positions=[i])

for i, ax in enumerate(axs):
ax.set_title('Group: ' + df_plots['Group'].unique()[i])
ax.set_xticklabels(['Type: X', 'Type: Y'])

boxplot output


Note that mileage may vary depending on version:






















matplotlib.__version__pd.__version__
confirmed working3.4.21.3.1
confirmed not working3.0.11.2.4

Grouped boxplot with 2 y axes 2 variables per x tick

By testing your code and comparing it to the answer by Thomas Kühn in the linked question, I see several things that stand out:

  • the data you input for the x parameter has a 1-D shape instead of 2-D. You input one variable so you get one box instead of the three you actually want;
  • the positions argument has not been defined, which causes the boxes of both boxplots to overlap;
  • in the first for loop over res1, the color argument in plt.setp is missing;
  • you have set x tick labels without first setting the x ticks (as cautioned here) which causes an error message.

I offer the following solution which is based more on this answer by ImportanceOfBeingErnest. It solves the issue of shaping the data correctly and it makes use of dictionaries to define many of the parameters that are shared by multiple objects in the plot. This makes it easier to adjust the format to your taste and also makes the code cleaner as it avoids the need for the for loops (over the boxplot elements and the res objects) and the repetition of arguments in functions that share the same parameters.

import numpy as np                 # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2

# Create a random dataset similar to the one in the image you shared
rng = np.random.default_rng(seed=123) # random number generator
data = dict(per = np.repeat([20, 40, 60], [60, 30, 10]),
cap = rng.choice([70, 90, 220, 240, 320, 330, 340, 360, 410], size=100),
cost = rng.integers(low=2050, high=2250, size=100))
df = pd.DataFrame(data)

# Pivot table according to the 'per' categories so that the cap and
# cost variables are grouped by them:
df_pivot = df.pivot(columns=['per'])

# Create a list of the cap and cost grouped variables to be plotted
# in each (twinned) boxplot: note that the NaN values must be removed
# for the plotting function to work.
cap = [df_pivot['cap'][var].dropna() for var in df_pivot['cap']]
cost = [df_pivot['cost'][var].dropna() for var in df_pivot['cost']]

# Create figure and dictionary containing boxplot parameters that are
# common to both boxplots (according to my style preferences):
# note that I define the whis parameter so that values below the 5th
# percentile and above the 95th percentile are shown as outliers
nb_groups = df['per'].nunique()
fig, ax1 = plt.subplots(figsize=(9,6))
box_param = dict(whis=(5, 95), widths=0.2, patch_artist=True,
flierprops=dict(marker='.', markeredgecolor='black',
fillstyle=None), medianprops=dict(color='black'))

# Create boxplots for 'cap' variable: note the double asterisk used
# to unpack the dictionary of boxplot parameters
space = 0.15
ax1.boxplot(cap, positions=np.arange(nb_groups)-space,
boxprops=dict(facecolor='tab:blue'), **box_param)

# Create boxplots for 'cost' variable on twin Axes
ax2 = ax1.twinx()
ax2.boxplot(cost, positions=np.arange(nb_groups)+space,
boxprops=dict(facecolor='tab:orange'), **box_param)

# Format x ticks
labelsize = 12
ax1.set_xticks(np.arange(nb_groups))
ax1.set_xticklabels([f'{label}%' for label in df['per'].unique()])
ax1.tick_params(axis='x', labelsize=labelsize)

# Format y ticks
yticks_fmt = dict(axis='y', labelsize=labelsize)
ax1.tick_params(colors='tab:blue', **yticks_fmt)
ax2.tick_params(colors='tab:orange', **yticks_fmt)

# Format axes labels
label_fmt = dict(size=12, labelpad=15)
ax1.set_xlabel('Percentage of RES integration', **label_fmt)
ax1.set_ylabel('Production Capacity (MW)', color='tab:blue', **label_fmt)
ax2.set_ylabel('Costs (Rupees)', color='tab:orange', **label_fmt)

plt.show()

boxplots_grouped_twinx

Matplotlib documentation: boxplot demo, boxplot function parameters, marker symbols for fliers, label text formatting parameters

Considering that it is quite an effort to set this up, if I were to do this for myself, I would go for side-by-side subplots instead of creating twinned Axes. This can be done quite easily in seaborn using the catplot function which takes care of a lot of the formatting automatically. Seeing as there are only three categories per variable, it is relatively easy to compare the boxplots side-by-side using a different color for each percentage category, as illustrated with this example based on the same data:

import seaborn as sns    # v 0.11.0

# Convert dataframe to long format with 'per' set aside as a grouping variable
df_melt = df.melt(id_vars='per')

# Create side-by-side boxplots of each variable: note that the boxes
# are colored by default
g = sns.catplot(kind='box', data=df_melt, x='per', y='value', col='variable',
height=4, palette='Blues', sharey=False, saturation=1,
width=0.3, fliersize=2, linewidth=1, whis=(5, 95))

g.fig.subplots_adjust(wspace=0.4)
g.set_titles(col_template='{col_name}', size=12, pad=20)

# Format Axes labels
label_fmt = dict(size=10, labelpad=10)
for ax in g.axes.flatten():
ax.set_xlabel('Percentage of RES integration', **label_fmt)
g.axes.flatten()[0].set_ylabel('Production Capacity (MW)', **label_fmt)
g.axes.flatten()[1].set_ylabel('Costs (Rupees)', **label_fmt)

plt.show()

boxplots_sbs

Pandas Grouped Boxplot by Category to Compare 3 Datasets Matplotlib

You can do groupby first then do boxplot on obs, raw, adj columns.

To customize the plot property, you can add return_type='axes' to get matplotlib axes, then you can call the customization function on the axes.

This pandas boxplot makes shared y-axis plot which means you cannot have different y limit in each plot.

I don't know how to disable the shared y-axis in this solution, however, I know how to achieve dynamic y-axis limits in alternative solution. (Please see Update below)

axes = df.groupby('site_name').boxplot(column=['obs','raw','adj'], return_type='axes')

for ax in axes:
ax.set_ylim(0, 20) # Change y-axis limit to 0 - 20

================================================

Update:

If you would like to apply different y limits to each plot, try with subplots in matplotlib.

import matplotlib.pyplot as plt

# Create 2 subplots for SITE A, SITE B in 1 row. By default, this will create 2 subplots that does not share the y-axis.
# To make shared y-axis, use plt.subplots(1, 2, sharey=True)
fig, axs = plt.subplots(1, 2)

# Create a boxplot per site_name and render in the subplots by passing ax=axs
df.groupby('site_name').boxplot(column=['obs', 'raw', 'adj'], ax=axs)

# Apply different y limits
axs[0].set_ylim(0, 50) # For SITE A
axs[1].set_ylim(0, 20) # For SITE B

# display my plots
fig

Sample Image

Using matplotlib to plot a multiple boxplots

You can concat with the city names as MultiIndex, and use seaborn.catplot to plot:

df = pd.concat(dict(zip(city_df['Name'], l)), names=['city']).reset_index(level=0)

import seaborn as sns
sns.catplot(data=df, row=1, x='city', y=2, kind='box', sharey=False)

output:

           city  0        1        2
0 San Franciso 1 kids 0.00094
1 San Franciso 2 adult 0.00120
2 San Franciso 3 elderly 0.00114
3 San Franciso 5 kids 0.00088
4 San Franciso 6 adult 0.00113
5 San Franciso 7 elderly 0.00105
0 Paris 1 kids 0.00044
1 Paris 2 adult 0.00120
2 Paris 3 elderly 0.00114
3 Paris 5 kids 0.00088
4 Paris 6 adult 0.00113
5 Paris 7 elderly 0.00105
0 Tokyo 1 kids 0.00394
1 Tokyo 2 adult 0.00120
2 Tokyo 3 elderly 0.00114
3 Tokyo 5 kids 0.00588
4 Tokyo 6 adult 0.00113
5 Tokyo 7 elderly 0.00105
0 London 1 kids 0.00074
1 London 2 adult 0.00120
2 London 3 elderly 0.00114
3 London 5 kids 0.00088
4 London 6 adult 0.00113
5 London 7 elderly 0.00105
0 Barcelona 1 kids 0.00095
1 Barcelona 2 adult 0.00120
2 Barcelona 3 elderly 0.00114
3 Barcelona 5 kids 0.00043
4 Barcelona 6 adult 0.00113
5 Barcelona 7 elderly 0.00105

Sample Image

Add aggregate of all data to boxplots

You can use the gridspec_kw the set the ratios between the plots (e.g. [1,4] as one subplot has 4 times as many boxes). The spacing between the subplots can be fine-tuned via hspace. axes[0].set_yticklabels() lets you set the label.

import matplotlib.pyplot as plt
import seaborn as sns

data = {"domain": ["econ", "econ", "public_affairs", "culture", "communication", "public_affairs", "communication", "culture", "public_affairs", "econ", "culture", "econ", "communication"],
"score": [0.25, 0.3, 0.5684, 0.198, 0.15, 0.486, 0.78, 0.84, 0.48, 0.81, 0.1, 0.23, 0.5]}
fig, axes = plt.subplots(2, 1, sharex=True,
gridspec_kw={'height_ratios': [1, 4], 'hspace': 0})
sns.set_style('white')
sns.boxplot(ax=axes[0], data=data["score"], orient="h", color='0.6')
axes[0].set_yticklabels(['All'])
sns.boxplot(ax=axes[1], x="score", y="domain", palette='Set2', data=data)
plt.tight_layout()
plt.show()

sns.boxplot together with overall box

An alternative approach is to concatenate the data with a copy and a label that's "All" everywhere. For a pandas dataframe you could use df.copy() and pd.concat(). With just a dictionary of lists, you could simply duplicate the lists.

This way all boxes have exactly the same thickness. As it uses just one ax, it combines more easily with other subplots.

import matplotlib.pyplot as plt
import seaborn as sns

data = {"domain": ["econ", "econ", "public_affairs", "culture", "communication", "public_affairs", "communication", "culture", "public_affairs", "econ", "culture", "econ", "communication"],
"score": [0.25, 0.3, 0.5684, 0.198, 0.15, 0.486, 0.78, 0.84, 0.48, 0.81, 0.1, 0.23, 0.5]}

data_concatenated = {"domain": ['All'] * len(data["domain"]) + data["domain"],
"score": data["score"] * 2}

sns.set_style('darkgrid')
palette = ['yellow'] + list(plt.cm.Set2.colors)
ax = sns.boxplot(x="score", y="domain", palette=palette, data=data_concatenated)
ax.axhline(0.5, color='0.5', ls=':')
plt.tight_layout()
plt.show()

sns.boxplot of concatenated dataframe

Here is another example, working with pandas and seaborn's flights dataset. It shows different ways to make the summary stand out without adding an extra horizontal line:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

flights = sns.load_dataset('flights')
flights_all = flights.copy()
flights_all['year'] = 'All'

sns.set_style('darkgrid')
palette = ['crimson'] + sns.color_palette('crest', len(flights['year'].unique()))

ax = sns.boxplot(x="passengers", y="year", palette=palette, orient='h', data=pd.concat([flights_all, flights]))
ax.axhspan(-0.5, 0.5, color='0.85', zorder=-1)
# ax.axhline(0.5, color='red', ls=':') # optional separator line
# ax.get_yticklabels()[0].set_color('crimson')
ax.get_yticklabels()[0].set_weight('bold')
plt.tight_layout()
plt.show()

sns.boxplots with summary, flights dataset

How to make a nested / grouped boxplot using seaborn of a 3D array

Is this what you want?

Sample Image

While I think seaborn can handle wide data, I personally find it easier to work with "tidy data" (or long data). To convert your dataframe from the "wide" to "long" you can use DataFrame.melt and make sure to preserve your input.

So

>>> mses_frame.melt(ignore_index=False)

variable value
n_comps samples
3 sample_0 obj_0 0.424960
sample_1 obj_0 0.758884
sample_2 obj_0 0.408663
sample_3 obj_0 0.440811
sample_4 obj_0 0.112798
... ... ...
15 sample_95 obj_13 0.172044
sample_96 obj_13 0.381045
sample_97 obj_13 0.364024
sample_98 obj_13 0.737742
sample_99 obj_13 0.762252

[7000 rows x 2 columns]

Again, seaborn probably can work with this somehow (maybe someone else can comment on this) but I find it easier to reset the index so your multi indices become columns

>>> mses_frame.melt(ignore_index=False).reset_index()

n_comps samples variable value
0 3 sample_0 obj_0 0.424960
1 3 sample_1 obj_0 0.758884
2 3 sample_2 obj_0 0.408663
3 3 sample_3 obj_0 0.440811
4 3 sample_4 obj_0 0.112798
... ... ... ... ...
6995 15 sample_95 obj_13 0.172044
6996 15 sample_96 obj_13 0.381045
6997 15 sample_97 obj_13 0.364024
6998 15 sample_98 obj_13 0.737742
6999 15 sample_99 obj_13 0.762252

[7000 rows x 4 columns]

Now you can decide what you want to plot, I think you are saying you want

sns.boxplot(x="variable", y="value", hue="n_comps", 
data=mses_frame.melt(ignore_index=False).reset_index())

Let me know if I've misunderstood something



Related Topics



Leave a reply



Submit