Obtaining Separate Summary Statistics by Categorical Variable with Stargazer Package

Obtaining Separate Summary Statistics by Categorical Variable with Stargazer Package

Solution

library(stargazer)
library(dplyr)
library(tidyr)

ToothGrowth %>%
group_by(supp) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, len, dose) %>%
unite(temp1, supp, temp, sep = '_') %>%
spread(temp1, val) %>%
select(-id) %>%
as.data.frame() %>%
stargazer(type = 'text')

Result

=========================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------
OJ_dose 30 1.167 0.634 0.500 2.000
OJ_len 30 20.663 6.606 8.200 30.900
VC_dose 30 1.167 0.634 0.500 2.000
VC_len 30 16.963 8.266 4.200 33.900
-----------------------------------------

Explanation

This gets rid of the problem mentioned by the OP in a comment to the original answer, "What I really want is a single table with summary statistics separated by a categorical variable instead of creating separate tables." The easiest way I saw to do that with stargazer was to create a new data frame that had variables for each group's observations using a gather(), unite(), spread() strategy. The only trick to it is to avoid duplicate identifiers by creating unique identifiers by group and dropping that variable before calling stargazer().

Summary statistics for each category of categorical variables in R

I'm assuming you want each categoric approached separately rather than in combination.
You could start with

library(SmartEDA)
library(purrr)
map(c("gender","education" ),
~ExpCustomStat(demographics,
Cvar=.x,
Nvar=c("pandl_r2","pandl_r3") ,
stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max'))
)

where nvar has the numeric's to assess and you list out the categories in the first input to the map. if you want all the results stacked you'd have to map the first column to a generic name before stacking like so

library(dplyr)
map_dfr(c("gender","education" ),
~ExpCustomStat(demographics,
Cvar=.x,
Nvar=c("pandl_r2","pandl_r3") ,
stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max')) |>
rename_at(1, \(x)"var") |> mutate(catname = .x) |> relocate(catname)
)

How to create a summary statistics table with two groups using stargazer?

Not entirely sure what your desired output is but does this help?

mtcars %>% 
group_by(am) %>%
summarise(mpg = mean(mpg), disp = mean(disp), hp = mean(hp)) %>%
gather(key = "variable","value",mpg,disp,hp) %>%
spread(am,value) %>%
group_by(variable) %>%
mutate(difference = `1`-`0`)

## Source: local data frame [3 x 4]
## Groups: variable [3]
##
## variable `0` `1` difference
## <chr> <dbl> <dbl> <dbl>
## 1 disp 290.37895 143.53077 -146.848178
## 2 hp 160.26316 126.84615 -33.417004
## 3 mpg 17.14737 24.39231 7.244939

R: Summary statistics for groups / subsets within panel data - code and layout

You can use this dplyr/tidy pipeline:

library(tidyverse)

dt %>%
group_by(Rating) %>%
summarize(mean_Revenue = mean(Revenue),
mean_Costs = mean(Costs),
mean_Age = mean(Age),
Observations=n()
) %>%
pivot_longer(cols = !Rating) %>%
pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>%
mutate(`Anova F-Test (p-value)` = c(sapply(dt %>% select(Revenue:Age), function(y) anova(lm(y~dt$Rating))$`Pr(>F)`[[1]]),NA)) %>%
left_join(
dt %>%
pivot_longer(cols=Revenue:Age) %>%
group_by(name = paste0("mean_",name)) %>%
summarize(Total_means=mean(value))
)

Output:

  name         Rating1 Rating2 Rating3 Rating4 Rating5 `Anova F-Test (p-value)` Total_means
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mean_Revenue 200 400 250 300 200 0.742 289.
2 mean_Costs 45 26.7 40 30 20 0.196 33.3
3 mean_Age 2 3 4 4 2 0.552 3
4 Observations 2 3 2 1 1 NA NA

Updated 4/22/22

  • Original answer did not limit the anova to Ratings 1 and 5
# small function to get anova
get_anova <-function(y,rating, ratings=c(1,5)) {
y_ = y[rating %in% ratings]
x_ = rating[rating %in% ratings]
anova(lm(y_~x_))$`Pr(>F)`[[1]]
}

dt %>%
group_by(Rating) %>%
summarize(mean_Revenue = mean(Revenue),
mean_Costs = mean(Costs),
mean_Age = mean(Age),
Observations=n()
) %>%
pivot_longer(cols = !Rating) %>%
pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>%
mutate(anova = c(sapply(dt %>% select(Revenue:Age), function(y) get_anova(y,rating=dt$Rating)),NA)) %>%
left_join(
dt %>%
pivot_longer(cols=Revenue:Age) %>%
group_by(name = paste0("mean_",name)) %>%
summarize(Total_means=mean(value))
)

Analysing a data frame that contains a time series using stargazer

You can either use split + lapply from base R:

library(stargazer)

lapply(split(df, df$year), stargazer, type = "text")

or by:

by(df, df$year, stargazer, type = 'text')

Result:

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,008.000 0.000 2,008 2,008
---------------------------------------------------------------

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,009.000 0.000 2,009 2,009
---------------------------------------------------------------

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,010.000 0.000 2,010 2,010
---------------------------------------------------------------
df$year: 2008
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,008.000 0.000 2,008 2,008 "
[8] "---------------------------------------------------------------"
--------------------------------------------------------------------------
df$year: 2009
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,009.000 0.000 2,009 2,009 "
[8] "---------------------------------------------------------------"
--------------------------------------------------------------------------
df$year: 2010
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,010.000 0.000 2,010 2,010 "
[8] "---------------------------------------------------------------"

The disadvantage of these two methods is that they print out the tables twice (once from stargazer output, another from lapply/by). To get around this, you can use walk form purrr to only call stargazer for it's side-effects:

library(dplyr)
library(purrr)

df %>%
split(.$year) %>%
walk(~ stargazer(., type = "text"))

Result:

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,008.000 0.000 2,008 2,008
---------------------------------------------------------------

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,009.000 0.000 2,009 2,009
---------------------------------------------------------------

===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,010.000 0.000 2,010 2,010
---------------------------------------------------------------

Note:

All methods above works for latex output (type = "latex"). I only set type = "text" for demonstrative purposes.

Create and Export a Summary Statistics Table

You have a tibble and stargazer doesn't support it. If you change it to dataframe it works.

library(stargazer)

data_stuct <- data.frame(data_stuct)

stargazer(data_stuct[c("BNBClose", "BTCClose", "ADAClose", "LINKClose",
"DODGEClose")],type="text",title="Summary Statistics", out="table1.txt")

#Summary Statistics
#========================================================================
#Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
#------------------------------------------------------------------------
#BNBClose 10 1.568 0.219 1.217 1.411 1.659 1.965
#BTCClose 10 4,507.324 220.613 4,229.360 4,339.010 4,731.635 4,826.480
#ADAClose 10 0.022 0.002 0.019 0.021 0.022 0.026
#LINKClose 10 0.408 0.043 0.346 0.385 0.440 0.476
#DODGEClose 10 0.001 0.00004 0.001 0.001 0.001 0.001
------------------------------------------------------------------------


Related Topics



Leave a reply



Submit