Getting Both Column Counts and Proportions in the Same Table in R

Getting both column counts and proportions in the same table in R

Here is one approach, you still need a second step, but it comes before the tabular command so the result is still a tabular object.

n <- 100 
x <- sample(letters[1:3], n, T)
y <- sample(letters[1:3], n, T)
d <- data.frame(x=x, y=y)
d$z <- 1/ave( rep(1,n), d$x, FUN=sum )

(t1 <- tabular(x~y*Heading()*z*((n=length) + (p=sum)), d))

Two by two table with count and percentage in R

library(dplyr)
df %>% group_by(Gender,OnAntibiotic) %>% mutate(n=n()) %>%
group_by(OnAntibiotic) %>% distinct(OnAntibiotic,Gender,n)%>%
mutate(Per=n/sum(n), np=paste0(n," (",round(Per*100,2)," %)")) %>%
select(-n,-Per) %>% spread(OnAntibiotic,np)

# A tibble: 2 x 3
Gender No Yes
<fct> <chr> <chr>
1 Female 3 (60 %) 8 (57.14 %)
2 Male 2 (40 %) 6 (42.86 %)

calculating the proportion of count variable per group in data.table in R

If you are looking for the ratio, you can do :

library(data.table)
mydata[, prop := count/sum(count) * 100, by = .(startYear, groupSize)]

# groupSize gender startYear count prop
# 1: intermediate F 2014 7546 55.9958445
# 2: small F 2014 3500 31.3395415
# 3: intermediate M 2014 5930 44.0041555
# 4: small M 2014 7668 68.6604585
# 5: huge F 2014 18114 56.7125861
# 6: huge M 2014 13826 43.2874139
# 7: large F 2014 11943 54.2222828
# 8: large M 2014 10083 45.7777172
#....

Tidy way to convert numeric columns from counts to proportions

Rephrase to the following:

df %>%
mutate_if(is.numeric, ~ . / rowSums(select(df, where(is.numeric))))

Output:

  id         x         y
1 A 0.3333333 0.6666667
2 B 0.3333333 0.6666667
3 C 0.3333333 0.6666667
4 D 0.3333333 0.6666667

Edit: If you want an answer that doesn't use any additional packages besides dplyr and base, and that can be piped more easily, here's one other (hacky) solution:

df %>%
group_by(id) %>%
mutate(sum = as.character(rowSums(select(cur_data(), is.numeric)))) %>%
summarise_if(is.numeric, ~ . / as.numeric(sum))

The usual dplyr ways of referring to the current data within a function (e.g. cur_data) don't seem to play nicely with rowSums in my original phrasing, so I took a slightly different approach here. There is likely a better way of doing this though, so I'm open to suggestions.

convert data frame of counts to proportions in R

Probably something along these lines:

df[, -1] <- lapply( df[ , -1], function(x) x/sum(x, na.rm=TRUE) )

If it were a matrix you could have just used prop.table(mat). In this case however you need to limit to working only on the numeric columns (by excluding the first one).

Furthermore I think you need to exclude the "total" row:

 my.data[-5, -1] <- lapply( my.data[ -5 , -1], function(x){ x/sum(x, na.rm=TRUE)} )
my.data[ -5 , ]
state y1970 y1980 y1990 y2000
1 Alaska 0.02325581 0.03076923 NA 0.02941176
2 Iowa 0.05813953 0.10256410 0.21428571 0.16806723
3 Nevada 0.58139535 0.51282051 0.71428571 0.42016807
4 Ohio 0.29069767 0.30769231 NA 0.33613445
6 Wyoming 0.04651163 0.04615385 0.07142857 0.04621849

-------------

Alternate approach:

> my.data[,-1] <-lapply( my.data[  , -1], function(x){ x/x[5] } )
> my.data
state y1970 y1980 y1990 y2000
1 Alaska 0.02325581 0.03076923 NA 0.02941176
2 Iowa 0.05813953 0.10256410 0.13953488 0.16806723
3 Nevada 0.58139535 0.51282051 0.46511628 0.42016807
4 Ohio 0.29069767 0.30769231 NA 0.33613445
5 total 1.00000000 1.00000000 1.00000000 1.00000000
6 Wyoming 0.04651163 0.04615385 0.04651163 0.04621849

This shows what prop.table will return with missing values when used on both margins and then on rows and columns separately for a very simple matrix:

> prop.table( matrix( c( 1,2,NA, 3),2) )
[,1] [,2]
[1,] NA NA
[2,] NA NA
> prop.table( matrix( c( 1,2,NA, 3),2), 1 )
[,1] [,2]
[1,] NA NA
[2,] 0.4 0.6
> prop.table( matrix( c( 1,2,NA, 3),2), 2 )
[,1] [,2]
[1,] 0.3333333 NA
[2,] 0.6666667 NA

How to Calculate Percentage Based On Other Row

This is the beginning of a solution:

library(dplyr)

Year <- rep(2000, 6)
State <- c(rep("VA", 4), rep("MA", 2))
Age <- c("<44", "44+", "44+", "<44", "<44", "44+")
Pop <- c(150, 350, 500, 200, 100, 100)

df <- data.frame(State = State, Age = Age, Pop = Pop, Year= Year)

df %>% filter(Age != "Total") %>% group_by(Year, State) %>% summarize(Pop44 = sum(Pop[Age=="<44"]) / sum(Pop))

You don't have to filter the "Total" category but it's usually not a good idea to have a "total" category (better have a column for that)

Calculating count and proportion of a certain value for a number of variables subsetted by other variables

You don't have to convert columns to factors. In fact, data.table recommends avoiding factors wherever possible, as it'll also improve speed. However, I'll illustrate how you can convert to factor much more easily for the future.

sd_cols = c("Feature1", "Feature2", "Feature3")
DT[, c(sd_cols) := lapply(.SD, as.factor), .SDcols=sd_cols]

Okay, now on to the solution. Of course we'll need to use CJ here because you need to get absent combinations as well. So, we've to generate that first.

uvals = c("no", "yes")
setkey(DT, Feature1, Feature2, Feature3)
DTn = DT[CJ(uvals, uvals, uvals), allow.cartesian=TRUE]

The allow.cartesian=TRUE is necessary because the join will result in more rows than max(nrow(x), nrow(i)) in a join x[i]. Read this post for more on allow.cartesian.

Now that we've all the combinations, we can group/aggregate them to obtain the results in the fashion you require.

ans = DTn[, { tmp1 = sum(Var1 == "yes", na.rm=TRUE);
tmp2 = sum(Var2 == "yes", na.rm=TRUE);
list(Var1.count = tmp1,
Var1.prop = tmp1/.N,
Var2.count = tmp2,
Var2.prop = tmp2/.N * 100)
}, by=key(DT)]

# Feature1 Feature2 Feature3 Var1.count Var1.prop Var2.count Var2.prop
# 1: no no no 0 0.0000000 1 1
# 2: no no yes 0 0.0000000 0 0
# 3: no yes no 0 0.0000000 0 0
# 4: no yes yes 1 1.0000000 1 1
# 5: yes no no 0 0.0000000 0 0
# 6: yes no yes 0 0.0000000 0 0
# 7: yes yes no 0 0.0000000 0 0
# 8: yes yes yes 2 0.6666667 3 1

I think you can play around to get the values as NA instead of 0, if that's really that important?


Following OP's question under comment + edit, after getting DTn:

vars = c("Var1", "Var2")
ans = DTn[, c(N=.N, lapply(.SD, function(x) sum(x=="yes", na.rm=TRUE))),
by=key(DTn), .SDcols=vars]
N = ans$N
ans[, N := NULL]
ans[, c(paste(vars, "prop", sep=".")) := .SD/N, .SDcols=vars]
setnames(ans, vars, paste(vars, "count", sep="."))

ans
# Feature1 Feature2 Feature3 Var1.count Var2.count Var1.prop Var2.prop
# 1: no no no 0 1 0.0000000 1
# 2: no no yes 0 0 0.0000000 0
# 3: no yes no 0 0 0.0000000 0
# 4: no yes yes 1 1 1.0000000 1
# 5: yes no no 0 0 0.0000000 0
# 6: yes no yes 0 0 0.0000000 0
# 7: yes yes no 0 0 0.0000000 0
# 8: yes yes yes 2 3 0.6666667 1

How about this?

Get the row (or column)-wise tabularized counts (as in table()) of a matrix

We convert the matrix from wide to long using melt from library(reshape2) and then do the table

library(reshape2)
table(melt(m)[3:2])
# Var2
#value 1 2 3
# a 1 1 3
# b 3 1 0
# c 0 2 0
# d 0 0 1

If we need the proportion, we can use prop.table and change the margin accordingly.

prop.table(table(melt(m)[3:2]),1)

Another convenient function is mtabulate from library(qdapTools)

library(qdapTools)
t(mtabulate(as.data.frame(m)))


Related Topics



Leave a reply



Submit