Plotting Wide Format Data Using R Ggplot

Plotting wide format data using R ggplot

Your data is in wide format so it's better to convert it to long format to work with ggplot. Here I use tidyr::gather() to do that

library(tidyr)
library(ggplot2)

df_long <- df %>%
gather(Year, Sales, -Region)
df_long
#> Region Year Sales
#> 1 A 2016 8758.82
#> 2 B 2016 25559.89
#> 3 C 2016 30848.02
#> 4 D 2016 8696.99
#> 5 E 2016 3621.12
#> 6 F 2016 5468.76
#> 7 A 2015 26521.67
#> 8 B 2015 89544.93
#> 9 C 2015 92825.55
#> 10 D 2015 28916.40
#> 11 E 2015 14004.54
#> 12 F 2015 16618.38
#> 13 A 2014 NA
#> 14 B 2014 NA
#> 15 C 2014 199673.73
#> 16 D 2014 37108.09
#> 17 E 2014 16909.87
#> 18 F 2014 20610.58
#> 19 A 2013 27605.35
#> 20 B 2013 NA
#> 21 C 2013 78794.31
#> 22 D 2013 31824.75
#> 23 E 2013 17990.21
#> 24 F 2013 17307.11
#> 25 A Total Sales 35280.49
#> 26 B Total Sales 115104.82
#> 27 C Total Sales 323347.30
#> 28 D Total Sales 74721.48
#> 29 E Total Sales 34535.53
#> 30 F Total Sales 42697.72

Plot: specify color = Region and group = Region inside aes so ggplot knows how to pick color and draw lines

ggplot(df_long, aes(x = Year, y = Sales, color = Region, group = Region)) +
geom_point() +
geom_line() +
scale_color_brewer(palette = 'Dark2') +
theme_classic(base_size = 12)
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 2 rows containing missing values (geom_path).

Sample Image

Can also use facet_grid()

ggplot(df_long, aes(x = Year, y = Sales, group = Region)) +
geom_point() +
geom_line() +
facet_grid(Region ~., scales = 'free_y') +
theme_bw(base_size = 12)
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 2 rows containing missing values (geom_path).

Sample Image

Created on 2018-10-12 by the reprex package (v0.2.1.9000)

ggplot: Why do I have to transform the data into the long format?

It's hard to be say for sure that this is impossible — for example, someone could write a wrapper package for ggplot that would do this automatically for you — but there's no obvious solution like this.

Hadley Wickham, the author of ggplot, has built the entire "tidyverse" ecosystem on the concept of tidy data, which is essentially data in long format. The basic reason for working with long-format data is that the same data can be represented by many wide formats, but the long format is typically unique. For example, suppose you have data representing revenues by year, country, and industrial sector. In a wide format, do columns represent year, country, sector, or some combination? In the tidyverse/ggplot world, you can simply specify which variable you want to use as the grouping variable. With a wide-format-oriented tool (such as base R's matplot), you would first reshape your data so that the columns represented the grouping variable (say, years), then plot it.

Wickham and co-workers built tools like gather (or pivot_longer in newer versions of the tidyverse) to make conversion to long format easy, and a wide variety of other tools to work with long ("tidy") data.

You could write wrappers around ggplot that would do the conversion ...

How to efficiently draw lots of graphs in R from data in a wide format?

EDIT A significantly revised answer is provided having clarified the needs.

The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.

My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.

A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.

# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data

# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))

Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.

# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)

# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify

# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)

# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)

In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.

As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.

# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)

My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.

The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.

  library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience

Now to examine each of the 18 plots (this may not work in RStudio):

  opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option

To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.

  for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}

And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)

  library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)

Bar graphs in a long data format

Another option is saying stat = "summary" and fun = "mean" in your geom_bar like this:

library(data.table)
library(dplyr)
library(ggplot2)
set.seed(7)
dv1 = runif(n = 100, min = 1, max = 7)
dv2 = runif(n = 100, min = 1, max = 7)
dv3 = runif(n = 100, min = 1, max = 7)
country <- rep(c("India", "US", "Poland"), length.out = 100)

df <- data.frame(country, dv1, dv2, dv3)

df$casenum <- seq.int(nrow(df))

df2 <- df %>% select(casenum, country, dv1, dv2, dv3)

df.melt <- data.table::melt(setDT(df2), id = 1L,
measure = list(c(3,4,5)),
value.name = c("dv"))
df.melt2 <- df2 %>%
select(casenum, country)

df.melt.final <- dplyr::left_join(df.melt, df.melt2, by="casenum")

ggplot(df.melt.final, aes(fill=variable, y=dv, x=country)) +
geom_bar(position="dodge", stat = "summary", fun = "mean")

Sample Image

# This are the means to show
df.melt.final %>%
group_by(country, variable) %>%
summarise(dv = mean(dv))
#> `summarise()` has grouped output by 'country'. You can override using the
#> `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: country [3]
#> country variable dv
#> <chr> <fct> <dbl>
#> 1 India dv1 4.18
#> 2 India dv2 3.97
#> 3 India dv3 4.34
#> 4 Poland dv1 4.14
#> 5 Poland dv2 4.25
#> 6 Poland dv3 4.28
#> 7 US dv1 3.84
#> 8 US dv2 4.66
#> 9 US dv3 3.66

Created on 2022-08-26 with reprex v2.0.2

Wide format causing bar plot data in ggplot to be doubled

To expand on the illuminating answer from Gregor Thomas, here's an example of how to pivot your data and plot it:

df %>%
pivot_longer(
-Covid,
values_to = "fraction",
names_to = c("sex", "type"),
names_sep = "_"
) %>%
ggplot(aes(x = sex, y = fraction, fill = Covid)) +
geom_col(position = "dodge")

Here pivot_longer takes the sex information embedded in the column names of your original data.frame and makes them available to ggplot as a variable so you can programmatically access them and make your plot respond to them.

UPDATE:

A 'tidy' solution with more manual control over aesthetics of each bar to achieve desired appearance:

df %>%
pivot_longer(
-Covid,
values_to = "fraction",
names_to = c("sex", "type"),
names_sep = "_"
) %>%
arrange(desc(Covid)) %>%
ggplot(aes(x = sex, y = fraction, group = Covid)) +
geom_col(position = "identity", aes(width = rep(c(0.25, 0.15), each = 2), fill = letters[1:4]), alpha = 1) +
scale_fill_manual(values = c(lightest_accent, light_accent, dark, lightest)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = .25),
labels = function(y) paste0(round(y*100,0),"%"),
seq(0, 1, by = .25),expand = expansion(mult = c(0, 0))) +
theme(legend.position = "none") +
coord_flip()

Note here the arrange call plots the bar in the desired order so those last in the data.frame get plotted last and go on top. The width and fill have to be set manually to match the desired order.

R - Order and Plot with long format data

Problem was with datatypes. Originally they were:

       Id     Types      Stat 
"numeric" "factor" "matrix"

The dummy dataframe below was working fine:

data.frame(id = rep(1:10, 10), type = rep(paste0("T", 1:10), each = 10), stat = rnorm(100))

which had the following class:

sapply(df,class)
id type stat
"integer" "factor" "numeric"

So it is just a question of transforming the data into the class above.



Related Topics



Leave a reply



Submit