Reproduce Table and Plot from Journal

Reproduce table and plot from journal

nevermind: it was done with the meta package

The amount of work involved will be significantly influenced by whether or not you want R code to produce that entire page of information, or want to use R to create the graph in the middle and add it to a document (MSWord, etc.) that contains the text info. If the latter approach is ok, then start with a horizontal bar plot (e.g., http://docs.ggplot2.org/current/geom_boxplot.html), use something like theme_classic() to clear background, and use xlim() to scale the x-axis. If you revise your question and add a text representation of an actual data frame (e.g., like in this post: date at which a percentage of maximum was surpassed), me or someone else would probably generate the ggplot code for you.

Forest plot with table ggplot coding

There were a few issues as @efz pointed out. In addition, you need to refactor the labels in your second column to allow them to match up with those in your first. It's probably going to look messy with the y axis labels and title alongside the table, so these could be removed too.

That leaves you something like:

forestdf$colour <- rep(c("white", "gray95"), 18)
p <- ggplot(forestdf, aes(x = rr, y = labels, xmin = rrlow, xmax = rrhigh)) +
  geom_hline(aes(yintercept = labels, colour = colour), size = 7) + 
  geom_pointrange(shape = 22, fill = "black") +
  geom_vline(xintercept = 1, linetype = 3) +
  xlab("Variable") +
  ylab("Adjusted Relative Risk with 95% Confidence Interval") +
  theme_classic() +
  scale_colour_identity() +
  scale_y_discrete(limits = rev(forestdf$labels)) +
  scale_x_log10(limits = c(0.25, 4), 
                breaks = c(0.25, 0.5, 1, 2, 4), 
                labels = c("0.25", "0.5", "1", "2", "4"), expand = c(0,0)) +
  theme(axis.text.y = element_blank(), axis.title.y = element_blank())

names(fplottable) <- c("labels", "eventnum", "arr")
fplottable$labels <- factor(fplottable$labels, rev(levels(forestdf$labels)))
fplottable$colour <- rep(c("white", "gray95"), 18)

data_table <- ggplot(data = fplottable, aes(y = labels)) +
  geom_hline(aes(yintercept = labels, colour = colour), size = 7) +
  geom_text(aes(x = 0, label = labels), hjust = 0) +
  geom_text(aes(x = 5, label = eventnum)) +
  geom_text(aes(x = 7, label = arr), hjust = 1) +
  scale_colour_identity() +
  theme_void() + 
  theme(plot.margin = margin(5, 0, 35, 0))

grid.arrange(data_table,p, ncol = 2)

Sample Image

Optimal/efficient plotting of survival/regression analysis results

Edit I've now put this together into a package on github. I've tested it using output from coxph, lm and glm.

Example:

devtools::install_github("NikNakk/forestmodel")
library("forestmodel")
example(forest_model)

Original code posted on SO (superseded by github package):

I've worked on this specifically for coxph models, though the same technique could be extended to other regression models, especially since it uses the broom package to extract the coefficients. The supplied forest_cox function takes as its arguments the output of coxph. (Data is pulled using model.frame to calculate the number of individuals in each group and to find the reference levels for factors.) It also takes a number of formatting arguments. The return value is a ggplot which can be printed, saved, etc.

The output is modelled on the NEJM figure shown in the question.

library("survival")
library("broom")
library("ggplot2")
library("dplyr")
forest_cox <- function(cox, widths = c(0.10, 0.07, 0.05, 0.04, 0.54, 0.03, 0.17),
                       colour = "black", shape = 15, banded = TRUE) {
  data <- model.frame(cox)
  forest_terms <- data.frame(variable = names(attr(cox$terms, "dataClasses"))[-1],
                             term_label = attr(cox$terms, "term.labels"),
                             class = attr(cox$terms, "dataClasses")[-1], stringsAsFactors = FALSE,
                             row.names = NULL) %>%
    group_by(term_no = row_number()) %>% do({
      if (.$class == "factor") {
        tab <- table(eval(parse(text = .$term_label), data, parent.frame()))
        data.frame(.,
                   level = names(tab),
                   level_no = 1:length(tab),
                   n = as.integer(tab),
                   stringsAsFactors = FALSE, row.names = NULL)
      } else {
        data.frame(., n = sum(!is.na(eval(parse(text = .$term_label), data, parent.frame()))),
                   stringsAsFactors = FALSE)
      }
    }) %>%
    ungroup %>%
    mutate(term = paste0(term_label, replace(level, is.na(level), "")),
           y = n():1) %>%
    left_join(tidy(cox), by = "term")

  rel_x <- cumsum(c(0, widths / sum(widths)))
  panes_x <- numeric(length(rel_x))
  forest_panes <- 5:6
  before_after_forest <- c(forest_panes[1] - 1, length(panes_x) - forest_panes[2])
  panes_x[forest_panes] <- with(forest_terms, c(min(conf.low, na.rm = TRUE), max(conf.high, na.rm = TRUE)))
  panes_x[-forest_panes] <-
    panes_x[rep(forest_panes, before_after_forest)] +
    diff(panes_x[forest_panes]) / diff(rel_x[forest_panes]) *
           (rel_x[-(forest_panes)] - rel_x[rep(forest_panes, before_after_forest)])

  forest_terms <- forest_terms %>%
    mutate(variable_x = panes_x[1],
           level_x = panes_x[2],
           n_x = panes_x[3],
           conf_int = ifelse(is.na(level_no) | level_no > 1,
                             sprintf("%0.2f (%0.2f-%0.2f)", exp(estimate), exp(conf.low), exp(conf.high)),
                             "Reference"),
           p = ifelse(is.na(level_no) | level_no > 1,
                      sprintf("%0.3f", p.value),
                      ""),
           estimate = ifelse(is.na(level_no) | level_no > 1, estimate, 0),
           conf_int_x = panes_x[forest_panes[2] + 1],
           p_x = panes_x[forest_panes[2] + 2]
  )

  forest_lines <- data.frame(x = c(rep(c(0, mean(panes_x[forest_panes + 1]), mean(panes_x[forest_panes - 1])), each = 2),
                                     panes_x[1], panes_x[length(panes_x)]),
                               y = c(rep(c(0.5, max(forest_terms$y) + 1.5), 3),
                                     rep(max(forest_terms$y) + 0.5, 2)),
                               linetype = rep(c("dashed", "solid"), c(2, 6)),
                               group = rep(1:4, each = 2))

  forest_headings <- data.frame(term = factor("Variable", levels = levels(forest_terms$term)),
                         x = c(panes_x[1],
                               panes_x[3],
                               mean(panes_x[forest_panes]),
                               panes_x[forest_panes[2] + 1],
                               panes_x[forest_panes[2] + 2]),
                         y = nrow(forest_terms) + 1,
                         label = c("Variable", "N", "Hazard Ratio", "", "p"),
                         hjust = c(0, 0, 0.5, 0, 1)
  )

  forest_rectangles <- data.frame(xmin = panes_x[1],
                                xmax = panes_x[forest_panes[2] + 2],
                                y = seq(max(forest_terms$y), 1, -2)) %>%
    mutate(ymin = y - 0.5, ymax = y + 0.5)

  forest_theme <- function() {
    theme_minimal() +
    theme(axis.ticks.x = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.title.y = element_blank(),
          axis.title.x = element_blank(),
          axis.text.y = element_blank(),
          strip.text = element_blank(),
          panel.margin = unit(rep(2, 4), "mm")
    )
  }

  forest_range <- exp(panes_x[forest_panes])
  forest_breaks <- c(
    if (forest_range[1] < 0.1) seq(max(0.02, ceiling(forest_range[1] / 0.02) * 0.02), 0.1, 0.02),
    if (forest_range[1] < 0.8) seq(max(0.2, ceiling(forest_range[1] / 0.2) * 0.2), 0.8, 0.2),
    1,
    if (forest_range[2] > 2) seq(2, min(10, floor(forest_range[2] / 2) * 2), 2),
    if (forest_range[2] > 20) seq(20, min(100, floor(forest_range[2] / 20) * 20), 20)
  )

  main_plot <- ggplot(forest_terms, aes(y = y))
  if (banded) {
    main_plot <- main_plot +
      geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
              forest_rectangles, fill = "#EFEFEF")
  }
  main_plot <- main_plot +
    geom_point(aes(estimate, y), size = 5, shape = shape, colour = colour) +
    geom_errorbarh(aes(estimate,
                       xmin = conf.low,
                       xmax = conf.high,
                       y = y),
                   height = 0.15, colour = colour) +
    geom_line(aes(x = x, y = y, linetype = linetype, group = group),
                 forest_lines) +
    scale_linetype_identity() +
    scale_alpha_identity() +
    scale_x_continuous(breaks = log(forest_breaks),
                       labels = sprintf("%g", forest_breaks),
                       expand = c(0, 0)) +
    geom_text(aes(x = x, label = label, hjust = hjust),
              forest_headings,
              fontface = "bold") +
    geom_text(aes(x = variable_x, label = variable),
              subset(forest_terms, is.na(level_no) | level_no == 1),
              fontface = "bold",
              hjust = 0) +
    geom_text(aes(x = level_x, label = level), hjust = 0, na.rm = TRUE) +
    geom_text(aes(x = n_x, label = n), hjust = 0) +
    geom_text(aes(x = conf_int_x, label = conf_int), hjust = 0) +
    geom_text(aes(x = p_x, label = p), hjust = 1) +
    forest_theme()
  main_plot
}

Sample data and plot

pretty_lung <- lung %>%
  transmute(time,
            status,
            Age = age,
            Sex = factor(sex, labels = c("Male", "Female")),
            ECOG = factor(lung$ph.ecog),
            `Meal Cal` = meal.cal)
lung_cox <- coxph(Surv(time, status) ~ ., pretty_lung)

print(forest_cox(lung_cox))

Cox PH plot

how to stacked bar plot from frequency table?

Try enabling fill in position for stacking:

library(ggplot2)  
#Code
example%>%
  dplyr::filter(drug_name=="drug1") %>% 
  dplyr::select(c(-drug_name,-quit_percent)) %>%
  tidyr::pivot_longer(-reason_quit) %>% 
  ggplot(aes(x = reason_quit, y = value, fill = name)) +
  geom_bar(stat = "identity",position = "fill")+
  scale_y_continuous(labels = scales::percent)

Output:

Sample Image

If you want the scale around 700, you should dodge the bars like this:

#Code 2
example%>%
  dplyr::filter(drug_name=="drug1") %>% 
  dplyr::select(c(-drug_name,-quit_percent)) %>%
  tidyr::pivot_longer(-reason_quit) %>% 
  ggplot(aes(x = reason_quit, y = value, fill = name)) +
  geom_bar(stat = "identity",position = position_dodge())

Output:

Sample Image

And if you want to see a stack bar until 700, you should compute the difference between you main variables like this:

#Code 3
example%>%
  mutate(diff_contuniue_freq=contuniue_freq-quit_freq) %>%
  dplyr::filter(drug_name=="drug1") %>% 
  dplyr::select(c(-drug_name,-quit_percent)) %>%
  tidyr::pivot_longer(-reason_quit) %>%
  filter(name!='contuniue_freq') %>%
  ggplot(aes(x = reason_quit, y = value, fill = name)) +
  geom_bar(stat = "identity",position = 'stack')

Output:

Sample Image

And in order to add percentages, you can do in the same pipeline:

#Code 4
example%>%
  mutate(diff_contuniue_freq=contuniue_freq-quit_freq) %>%
  dplyr::filter(drug_name=="drug1") %>% 
  dplyr::select(c(-drug_name,-quit_percent)) %>%
  tidyr::pivot_longer(-reason_quit) %>%
  filter(name!='contuniue_freq') %>%
  group_by(reason_quit) %>%
  mutate(pos=sum(value),percent=paste0(100*round(value[name=='quit_freq']/pos,2),'%')) %>%
  ggplot(aes(x = reason_quit, y = value, fill = name)) +
  geom_bar(stat = "identity",position = 'stack')+
  geom_text(aes(y=pos,label=percent),vjust=-0.25)

Output:

Sample Image

I'm having trouble using ggplot2 to reproduce a violin plot

I couldn't reproduce your plotting code, as it lacks the column mic. However, I think this is what you're looking for:

# load libraries
library(ggplot2)
library(ggforce)

# make toy data
set.seed(1); a <- data.frame(kind = sample(letters[1:10], 500, TRUE), value = sample(1:20, 500, TRUE))

# plot
ggplot(a, aes(x = kind, y = value, colour = kind))+geom_violin()+geom_sina(size = 2.1)

Of course you can play with the format (the shize of the points in the geom_sina call is the most evident).

Sample Image

EDIT

# redefine the first plot, removing the legend:
p1 <- ggplot(a, aes(x = kind, y = value, colour = kind))+
  geom_violin()+
  geom_sina(size = 1.1)+
  theme(legend.position = "none")

# Define toy data for the lower plot:
library(data.table)
set.seed(1)
Genes <- data.table(gene = sample(LETTERS[1:10], 20, TRUE), 
                    n = sample(1:10, 20, TRUE))

# add a coloring variable
Genes[, coloring := cut(n, 3, labels = 1:3)]

# plot the lower plot
p2 <- ggplot(Genes, aes(n, gene, colour = coloring))+
  geom_point(size = 2.8)+
  theme(axis.title = element_blank(), 
        axis.text.x = element_blank(), 
        legend.position = "none", 
        axis.ticks.x = element_blank(), 
        panel.background = element_blank())

# put both plots in the canvas:
library(patchwork)
p1+
 p2+
 plot_layout(ncol = 1, heights = c(.8, .2))

Which produces:

Sample Image

Save the Out[] table of a pandas dataframe as a figure

Here is a somewhat hackish solution but it gets the job done. You wanted a .pdf but you get a bonus .png. :)

import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

from PySide.QtGui import QImage
from PySide.QtGui import QPainter
from PySide.QtCore import QSize
from PySide.QtWebKit import QWebPage

arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
df =pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))

h = "<!DOCTYPE html> <html> <body> <p> " + df.to_html() + " </p> </body> </html>";
page = QWebPage()
page.setViewportSize(QSize(5000,5000))

frame = page.mainFrame()
frame.setHtml(h, "text/html")

img = QImage(1000,700, QImage.Format(5))
painter = QPainter(img)
frame.render(painter)
painter.end()
a = img.save("html.png")

pp = PdfPages('html.pdf')
fig = plt.figure(figsize=(8,6),dpi=1080) 
ax = fig.add_subplot(1, 1, 1)
img2 = plt.imread("html.png")
plt.axis('off')
ax.imshow(img2)
pp.savefig()
pp.close()

Edits welcome.

Reproduce Table and Plot from Journal