How to Generate Boxplot

how to create multiple boxplots from the same dataframe?

Using ggplot::facet_wrap() makes it easy to make a bunch of panels of a graph from one dataset. However for it to work the variable that defines the different panels has to be in a single column. In this case that means you have to get from the 'wide' format to a 'long' format of your data. For this I suggest tidyr::pivot_longer(). Last point is that you need to treat your label as a factor otherwise you won't get separate boxes as you do in base R because I think that converts the x variable into a factor by default which {ggplot2} won't do (although it will give an informative warning). Finally, if you want to have separate y axes for each plot you can set scales = "free_y" inside facet_wrap().

library(tidyverse)

df <- data.frame(var_1 = c(1,2,3,4,5,6,7,8,9),
var_2 = c(21,23,34,45,23,56,76,54,65),
var_3 = c(6,5,4,3,5,7,3,2,5),
label = c(1,1,1,2,1,2,2,1,2))

df %>%
pivot_longer(-label) %>%
ggplot(aes(factor(label), value)) +
geom_boxplot() +
facet_wrap(vars(name), nrow = 1)

Sample Image

Created on 2022-02-12 by the reprex package (v2.0.1)

How can I generate a series boxplot per hour of day for this dataset?

For example

df <- read.table(sep=",", header=T, text="
datetime,usage,available
2016-05-25 10:00:59.000000,12,96
2016-05-25 09:00:59.000000,8,96
2016-05-25 08:00:59.000000,0,96
2016-05-25 07:00:59.000000,0,96
2016-05-25 06:00:59.000000,0,96
2016-05-25 05:00:59.000000,0,96
2016-05-25 04:00:59.000000,0,96
2016-05-25 03:00:59.000000,0,96
2016-05-25 02:00:59.000000,0,96
2016-05-25 01:00:59.000000,0,96
2016-05-25 00:00:59.000000,0,96
2016-05-24 23:00:59.000000,0,96
2016-05-24 22:00:59.000000,0,96
2016-05-24 21:00:59.000000,0,96
2016-05-24 20:00:59.000000,2,96
2016-05-24 19:00:59.000000,0,96
2016-05-24 18:00:59.000000,8,96
2016-05-24 17:00:59.000000,15,96
2016-05-24 16:00:59.000000,20,96
2016-05-24 15:00:59.000000,19,96
2016-05-24 14:00:59.000000,3,96
2016-05-24 13:00:59.000000,6,96
2016-05-24 12:00:59.000000,9,96
2016-05-24 11:00:59.000000,13,96
2016-05-24 10:00:59.000000,16,96
2016-05-24 09:00:59.000000,11,96
2016-05-24 08:00:59.000000,1,96
2016-05-24 07:00:59.000000,5,96
2016-05-24 06:00:59.000000,2,96
2016-05-24 05:00:59.000000,0,96
2016-05-24 04:00:59.000000,0,96
2016-05-24 03:00:59.000000,0,96
2016-05-24 02:00:59.000000,0,96
2016-05-24 01:00:59.000000,0,96
2016-05-24 00:00:59.000000,0,96
2016-05-23 23:00:59.000000,0,96
2016-05-23 22:00:59.000000,0,96
2016-05-23 21:00:59.000000,0,96
2016-05-23 20:00:59.000000,4,96
2016-05-23 19:00:59.000000,0,96
2016-05-23 18:00:59.000000,0,96
2016-05-23 17:00:59.000000,0,96
2016-05-23 16:00:59.000000,3,96
2016-05-23 15:00:59.000000,5,96
2016-05-23 14:00:59.000000,2,96
2016-05-23 13:00:59.000000,18,96
2016-05-23 12:00:59.000000,10,96
2016-05-23 11:00:59.000000,7,96
2016-05-23 10:00:59.000000,9,96
2016-05-23 09:00:59.000000,1,96
2016-05-23 08:00:59.000000,1,96
2016-05-23 07:00:59.000000,1,96
2016-05-23 06:00:59.000000,1,96
2016-05-23 05:00:59.000000,1,96
2016-05-23 04:00:59.000000,1,96
2016-05-23 03:00:59.000000,1,96
2016-05-23 02:00:59.000000,1,96
2016-05-23 01:00:59.000000,1,96
2016-05-23 00:00:59.000000,1,96")
boxplot(df$usage~as.POSIXlt(df$datetime)$hour)

gives

Sample Image

Create one boxplot per cluster for each column of information for a dataframe

Most seaborn functions work best with the data in "long form".

Here is how the code could look like:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = pd.read_html('https://stackoverflow.com/questions/72301993/')[0]
df_long = df.melt(id_vars='cluster', value_vars=df.columns[:-1], var_name='variable', value_name='values')

sns.catplot(kind='box', data=df_long,
col='variable', y='values', x='cluster', hue='cluster', palette="colorblind", sharey=False, colwrap=2)
plt.tight_layout()
plt.show()

sns.catplot kind=box with long dataframe

How to create box plots from columns of dicts in pandas

  • sns.boxplot expects data to be supplied in a long form when specifying x= and y=.
  • In this case, based on the specifications of having each speech type as a separate plot, sns.catplot will be used because there is a col= parameter, which can be used to create separate plots for speech types.
  1. As mentioned in the OP, use .melt to unpivot the wide dataframe.
  2. .json_normalize can be used to convert the the 'value' column (dict type) into a flat table.
    • See Split / Explode a column of dictionaries into separate columns with pandas if there are issues with this step.
  3. Join the flattened table (vals) to dfm with .join.
    • This works because vals and dfm have matching indices.
  4. .melt the dataframe again.
  5. Plot the box plot from the long form dataframe.
  • Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
import pandas as pd
import seaborn as sns

# load the dict into a dataframe
df = pd.DataFrame(pos)

# unpivot the dataframe
dfm = df.melt(id_vars='dialog_num', var_name='model')

# convert the 'value' column of dicts to a flat table
vals = pd.json_normalize(dfm['value'])

# combine vals to dfm, without the 'value' column
dfm = dfm.iloc[:, 0:-1].join(vals)

# unpivot the dataframe again
dfm = dfm.melt(id_vars=['dialog_num', 'model'])

plot all of the speech types together

p = sns.boxplot(data=dfm, x='value', y='model')

Sample Image

plot speech types separately

  • Most speech types have only a single value, or no values.
p = sns.catplot(kind='box', data=dfm, x='value', y='model', col='variable', col_wrap=4, height=4)

Sample Image

DataFrames at each step

1: dfm.head()

   dialog_num   model                                                             value
0 0 model1 {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}
1 1 model1 {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}
2 2 model1 {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}
3 0 model2 {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}
4 1 model2 {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}

2: vals.head()

    INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
0 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN

3: dfm.head()

   dialog_num   model   INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
0 0 model1 0.030 0.0720 0.133 0.1090 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 model1 NaN NaN 0.130 NaN 0.0300 0.2 NaN NaN NaN NaN NaN NaN NaN NaN
2 2 model1 NaN 0.0830 0.125 0.0625 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 model2 0.285 0.1428 NaN NaN 0.1428 NaN 0.1428 NaN NaN NaN NaN NaN NaN NaN
4 1 model2 NaN NaN NaN 0.2000 0.1000 0.1 NaN 0.1 0.1 NaN NaN NaN NaN NaN

4: dfm.head()

   dialog_num   model variable  value
0 0 model1 INTJ 0.030
1 1 model1 INTJ NaN
2 2 model1 INTJ NaN
3 0 model2 INTJ 0.285
4 1 model2 INTJ NaN

How to generate rectangular boxplots in matplotlib?

  • numThread is in the notch parameter location in matplotlib.pyplot.boxplot. Remove it.
    • This parameter is False by default.
    • The notches represent the confidence interval (CI) around the median.
# plot without notch
plt.boxplot(data)
plt.ylabel('Time per thread (seconds)')
plt.xlabel('Number of Threads')

Sample Image

How can I make a boxplot for each row of df?

Simply type this:

df.T.boxplot()

The pandas.DataFrame.boxplot-method creates a box for each column. You can work around this by using the pandas.DataFrame.T-method which transposes the dataframe (i.e., rows become columns and vice versa).



Related Topics



Leave a reply



Submit