How to Apply Histogram on Dependent Data in R

How to apply histogram on dependent data in R?

Your data looks like this:

> data.female
        N11.1 N22.1 N33.1 N44.1 N21.1 N31.1 N32.1
Sinus     1.0   0.0   0.0   0.0   0.0   0.0  12.0
Arr/AHB   1.0   0.0   0.0   0.1   0.0   0.0  20.9
> data.male
        N11.1 N22.1 N33.1 N44.1 N21.1 N31.1 N32.1
Sinus     1.0   0.0   0.0   0.0   0.0   0.0   4.0
Arr/AHB   1.0   0.0   0.0   0.0   0.0   0.0  24.0

and you want to draw histograms of each row over multiple columns (like here) so the below demostrating.

1. Histogram for each row where Sinus and ArrAHB groups separated

You want to make a common identifier for Sinus and Arr/AHB so we create a new ID column for that. We use this method here with lattice pkg.

require(lattice)
Sinus<-c(1,0,0,0,0,0,12)
ArrAHB<-c(1,0,0,0.1,0,0,20.9)
Labels<-c("N11.1","N22.1","N33.1","N44.1","N21.1","N31.1","N32.1")
ID<-c("Sinus","Arr/AHB")
data.female<-data.frame(Sinus,ArrAHB,row.names=Labels)
data.female<-as.data.frame(t(data.female))
data.female$ID<-ID

barchart(N11.1+N22.1+N33.1+N44.1+N21.1+N31.1+N32.1 ~ ID,
         data=data.female,
         auto.key=list(space='right')
         )

Sample Image

and in comparison this is the chart for Man:

Sample Image

1.2. Your Factor data must be converted to vectors or better: read your original files directly into vectors, not factors!

Your input data is malformated as factor data, bad here, that is probably result of misusing read.csv such as missing hte flag na.strings="." or some malformated elements. More:

"Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.*

In order to use this malformated data, the factor elements must be turnt into numeric values. The class commands reveal your mistake in reading your original data into R such that

> class(data.female$N22.1)
[1] "factor"
> as.double(as.character(data.female$N22.1))
[1] 0 0

where the as.double(as.character(...)) allows use to maniputlate the data object again correctly. So the code

require(lattice)
data.female <- structure(list(N11.1 = structure(c(3L, 3L), .Label = c("", "0.0", "1.0", "N11"), class = "factor"),
                              N22.1 = structure(c(2L, 2L), .Label = c("", "0.0", "2.0", "N22"), class = "factor"),
                              N33.1 = structure(c(2L, 2L), .Label = c("", "0.0", "N33"), class = "factor"),
                              N44.1 = structure(2:3, .Label = c("", "0.0", "0.1", "0.2", "N44"), class = "factor"),
                              N21.1 = structure(c(2L, 2L), .Label = c("", "0.0", "N21"), class = "factor"),
                              N31.1 = structure(c(2L, 2L), .Label = c("", "0.0", "N31"), class = "factor"),
                              N32.1 = structure(c(5L, 7L), .Label = c("", "0.0", "10.8", "11.0", "12.0", "17.0", "20.9", "22.8", "24.0", "3.0", "4.0", "44.0", "N32"),
                                                class = "factor")), .Names = c("N11.1", "N22.1", "N33.1", "N44.1", "N21.1", "N31.1", "N32.1"),
                         row.names = c("Sinus", "Arr/AHB"), class = "data.frame")
data.female$ID<-c("Sinus","Arr/AHB")
data.female<-as.data.frame(data.female)

f<-function(x) as.double(as.character(x))   #factors converted to vectors

barchart(f(N11.1)+f(N22.1)+f(N33.1)+f(N44.1)+f(N21.1)+f(N31.1)+f(N32.1) ~ ID,
         data=data.female,
         auto.key=list(space='right')
         )

where the function f does the conversion from factors to vectors, alas factors are special kinds of vectors with class object and attribute value, more here.

Sample Image

where you need to manipulate the legend yourself.

2. Barchart again showing proportions

The data input changed to readable format (not output of some CSZ file): values in N32.1 is far larger than any other data in other columns.

require(lattice)
Sinus<-c(1,0,0,0,0,0,12)
ArrAHB<-c(1,0,0,0.1,0,0,20.9)
Labels<-c("N11.1","N22.2","N33.1","N44.1","N21.1","N31.1","N32.1")
ID<-c("Sinus","Arr/AHB")
data.female<-data.frame(Sinus,ArrAHB,row.names=Labels)
data.female<-t(data.female)

barchart(data.female,auto.key=list(space='right'))

> data.female
       N11.1 N22.2 N33.1 N44.1 N21.1 N31.1 N32.1
Sinus      1     0     0   0.0     0     0  12.0
ArrAHB     1     0     0   0.1     0     0  20.9

Sample Image

hist column in dependency of the relative frequency of other column (R)

Here's my solution:

library("ggplot2")
library("dplyr")
library("magrittr")
library("tidyr")

df <- data.frame(x = c(1,1,2,3,2,1), y = c(43,54,54,22,22,43))

#Creating a counter that will keep track
#Of how many of each number in y exist for each x category
df$n <- 1
df %<>% #This is a bidirectional pipe here that overwrites 'df' with the result!
  group_by(x, y) %>% #Unidirectional pipe
  tally(n) %>%
  mutate(n = round(n/sum(n), 2)) #Calculating as percentage

#Plotting
df %>% 
  ggplot(aes(fill = as.factor(y), y = n, x = x)) + 
  geom_bar(position = "fill", stat = "identity") + 
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Percentage contribution from each y category") + 
  #Adding the percentage values as labels
  geom_text(aes(label = paste0(n*100,"%")), position = position_stack(vjust = 0.5), size = 2)

Note: the y-axis values are presented as percentages because position="fill" is passed to geom_bar().

Draw histograms per row over multiple columns in R

If you use ggplot you won't need to do it as a loop, you can plot them all at once. Also, you need to reformat your data so that it's in long format not short format. You can use the melt function from the reshape package to do so.

library(reshape2)
new.df<-melt(HEIrank11,id.vars="HEI.ID")
names(new.df)=c("HEI.ID","Year","Rank")

substring is just getting rid of the X in each year

library(ggplot2)
ggplot(new.df, aes(x=HEI.ID,y=Rank,fill=substring(Year,2)))+
   geom_histogram(stat="identity",position="dodge")

Sample Image

How do I generate a histogram for each column of my table?

If you combine the tidyr and ggplot2 packages, you can use facet_wrap to make a quick set of histograms of each variable in your data.frame.

You need to reshape your data to long form with tidyr::gather, so you have key and value columns like such:

library(tidyr)
library(ggplot2)
# or `library(tidyverse)`

mtcars %>% gather() %>% head()
#>   key value
#> 1 mpg  21.0
#> 2 mpg  21.0
#> 3 mpg  22.8
#> 4 mpg  21.4
#> 5 mpg  18.7
#> 6 mpg  18.1

Using this as our data, we can map value as our x variable, and use facet_wrap to separate by the key column:

ggplot(gather(mtcars), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Sample Image

The scales = 'free_x' is necessary unless your data is all of a similar scale.

You can replace bins = 10 with anything that evaluates to a number, which may allow you to set them somewhat individually with some creativity. Alternatively, you can set binwidth, which may be more practical, depending on what your data looks like. Regardless, binning will take some finesse.

R- split histogram according to factor level

You can use the ggplot2 package:

library(ggplot2)

ggplot(data,aes(x=diff))+geom_histogram()+facet_grid(~type)+theme_bw()

Sample Image

You can also put them on the same plot by "dodging" them:

ggplot(data,aes(x=diff,group=type,fill=type))+
  geom_histogram(position="dodge",binwidth=0.25)+theme_bw()

Sample Image

If you want them to overlap, the position has to be position="identity"

ggplot(data,aes(x=diff,group=type,fill=type))+
  geom_histogram(position="identity",alpha=0.5,binwidth=0.25)+theme_bw()

Sample Image

If you want them to look like it does in the first one but without the border, you have to hack it a little:

data$diff[data$type==1] <- data$diff[data$type==1] + 6

ggplot(data,aes(x=diff,group=type,fill=type))+
  geom_histogram(position="identity",alpha=0.5,binwidth=0.25)+theme_bw()+
  scale_x_continuous(breaks=c(-2:2,4:8),labels=c(-2:2,-2:2))

Sample Image

How can I create a histogram for all variables in a data set with minimal effort in R?

There may be three broad approaches:

Commands from packages such as hist.data.frame()
Looping over variables or similar macro constructs
Stacking variables and using facets

Packages

Other commands available that may be helpful:

library(plyr)
library(psych)
multi.hist(mpg) #error, not numeric
multi.hist(mpg[,sapply(mpg, is.numeric)])

or perhaps multhist from plotrix, which I haven't explored. Both of them do not offer the flexibilty I was looking for.

Loops

As an R beginner everyone advised me to stay away from loops. So I did, but perhaps it is worth a try here. Any suggestions are very welcome. Perhaps you could comment on how to combine the graphs into one file.

Stacking

My first suspicion was that stacking variables might get out of hand. However, it might be the best strategy for a reasonable set of variables.

One example I came up with uses the melt function.

library(reshape2)
mpgid <- mutate(mpg, id=as.numeric(rownames(mpg)))
mpgstack <- melt(mpgid, id="id")
pp <- qplot(value, data=mpgstack) + facet_wrap(~variable, scales="free")
# pp + stat_bin(geom="text", aes(label=..count.., vjust=-1))
ggsave("mpg-histograms.pdf", pp, scale=2)

(As you can see I tried to put value labels on the bars for more information density, but that didn't go so well. The labels on the x-axis are also less than ideal.)

No solution here is perfect and there won't be a one-size-fits-all command. But perhaps we can get closer to ease exploring a new data set.

How to Apply Histogram on Dependent Data in R