Obtaining Separate Summary Statistics by Categorical Variable with Stargazer Package
Solution
library(stargazer)
library(dplyr)
library(tidyr)
ToothGrowth %>%
group_by(supp) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, len, dose) %>%
unite(temp1, supp, temp, sep = '_') %>%
spread(temp1, val) %>%
select(-id) %>%
as.data.frame() %>%
stargazer(type = 'text')
Result
=========================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------
OJ_dose 30 1.167 0.634 0.500 2.000
OJ_len 30 20.663 6.606 8.200 30.900
VC_dose 30 1.167 0.634 0.500 2.000
VC_len 30 16.963 8.266 4.200 33.900
-----------------------------------------
Explanation
This gets rid of the problem mentioned by the OP in a comment to the original answer, "What I really want is a single table with summary statistics separated by a categorical variable instead of creating separate tables." The easiest way I saw to do that with stargazer
was to create a new data frame that had variables for each group's observations using a gather()
, unite()
, spread()
strategy. The only trick to it is to avoid duplicate identifiers by creating unique identifiers by group and dropping that variable before calling stargazer()
.
Summary statistics for each category of categorical variables in R
I'm assuming you want each categoric approached separately rather than in combination.
You could start with
library(SmartEDA)
library(purrr)
map(c("gender","education" ),
~ExpCustomStat(demographics,
Cvar=.x,
Nvar=c("pandl_r2","pandl_r3") ,
stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max'))
)
where nvar has the numeric's to assess and you list out the categories in the first input to the map. if you want all the results stacked you'd have to map the first column to a generic name before stacking like so
library(dplyr)
map_dfr(c("gender","education" ),
~ExpCustomStat(demographics,
Cvar=.x,
Nvar=c("pandl_r2","pandl_r3") ,
stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max')) |>
rename_at(1, \(x)"var") |> mutate(catname = .x) |> relocate(catname)
)
How to create a summary statistics table with two groups using stargazer?
Not entirely sure what your desired output is but does this help?
mtcars %>%
group_by(am) %>%
summarise(mpg = mean(mpg), disp = mean(disp), hp = mean(hp)) %>%
gather(key = "variable","value",mpg,disp,hp) %>%
spread(am,value) %>%
group_by(variable) %>%
mutate(difference = `1`-`0`)
## Source: local data frame [3 x 4]
## Groups: variable [3]
##
## variable `0` `1` difference
## <chr> <dbl> <dbl> <dbl>
## 1 disp 290.37895 143.53077 -146.848178
## 2 hp 160.26316 126.84615 -33.417004
## 3 mpg 17.14737 24.39231 7.244939
R: Summary statistics for groups / subsets within panel data - code and layout
You can use this dplyr/tidy pipeline:
library(tidyverse)
dt %>%
group_by(Rating) %>%
summarize(mean_Revenue = mean(Revenue),
mean_Costs = mean(Costs),
mean_Age = mean(Age),
Observations=n()
) %>%
pivot_longer(cols = !Rating) %>%
pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>%
mutate(`Anova F-Test (p-value)` = c(sapply(dt %>% select(Revenue:Age), function(y) anova(lm(y~dt$Rating))$`Pr(>F)`[[1]]),NA)) %>%
left_join(
dt %>%
pivot_longer(cols=Revenue:Age) %>%
group_by(name = paste0("mean_",name)) %>%
summarize(Total_means=mean(value))
)
Output:
name Rating1 Rating2 Rating3 Rating4 Rating5 `Anova F-Test (p-value)` Total_means
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mean_Revenue 200 400 250 300 200 0.742 289.
2 mean_Costs 45 26.7 40 30 20 0.196 33.3
3 mean_Age 2 3 4 4 2 0.552 3
4 Observations 2 3 2 1 1 NA NA
Updated 4/22/22
- Original answer did not limit the anova to Ratings 1 and 5
# small function to get anova
get_anova <-function(y,rating, ratings=c(1,5)) {
y_ = y[rating %in% ratings]
x_ = rating[rating %in% ratings]
anova(lm(y_~x_))$`Pr(>F)`[[1]]
}
dt %>%
group_by(Rating) %>%
summarize(mean_Revenue = mean(Revenue),
mean_Costs = mean(Costs),
mean_Age = mean(Age),
Observations=n()
) %>%
pivot_longer(cols = !Rating) %>%
pivot_wider(id_cols = "name",names_from = Rating,values_from = value,names_glue = "Rating{.name}") %>%
mutate(anova = c(sapply(dt %>% select(Revenue:Age), function(y) get_anova(y,rating=dt$Rating)),NA)) %>%
left_join(
dt %>%
pivot_longer(cols=Revenue:Age) %>%
group_by(name = paste0("mean_",name)) %>%
summarize(Total_means=mean(value))
)
Analysing a data frame that contains a time series using stargazer
You can either use split
+ lapply
from base R:
library(stargazer)
lapply(split(df, df$year), stargazer, type = "text")
or by
:
by(df, df$year, stargazer, type = 'text')
Result:
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,008.000 0.000 2,008 2,008
---------------------------------------------------------------
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,009.000 0.000 2,009 2,009
---------------------------------------------------------------
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,010.000 0.000 2,010 2,010
---------------------------------------------------------------
df$year: 2008
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,008.000 0.000 2,008 2,008 "
[8] "---------------------------------------------------------------"
--------------------------------------------------------------------------
df$year: 2009
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,009.000 0.000 2,009 2,009 "
[8] "---------------------------------------------------------------"
--------------------------------------------------------------------------
df$year: 2010
[1] ""
[2] "==============================================================="
[3] "Statistic N Mean St. Dev. Min Max "
[4] "---------------------------------------------------------------"
[5] "Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131"
[6] "Distance..km. 10 5,637.500 2,385.941 2,211 9,500 "
[7] "year 10 2,010.000 0.000 2,010 2,010 "
[8] "---------------------------------------------------------------"
The disadvantage of these two methods is that they print out the tables twice (once from stargazer
output, another from lapply
/by
). To get around this, you can use walk
form purrr
to only call stargazer
for it's side-effects:
library(dplyr)
library(purrr)
df %>%
split(.$year) %>%
walk(~ stargazer(., type = "text"))
Result:
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,083,988.000 7,541,970.000 491,723 21,759,420
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,008.000 0.000 2,008 2,008
---------------------------------------------------------------
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,361,404.000 7,798,880.000 496,963 22,549,547
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,009.000 0.000 2,009 2,009
---------------------------------------------------------------
===============================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------
Population 10 9,645,370.000 8,065,676.000 502,384 23,369,131
Distance..km. 10 5,637.500 2,385.941 2,211 9,500
year 10 2,010.000 0.000 2,010 2,010
---------------------------------------------------------------
Note:
All methods above works for latex output (type = "latex"
). I only set type = "text"
for demonstrative purposes.
Create and Export a Summary Statistics Table
You have a tibble and stargazer
doesn't support it. If you change it to dataframe it works.
library(stargazer)
data_stuct <- data.frame(data_stuct)
stargazer(data_stuct[c("BNBClose", "BTCClose", "ADAClose", "LINKClose",
"DODGEClose")],type="text",title="Summary Statistics", out="table1.txt")
#Summary Statistics
#========================================================================
#Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
#------------------------------------------------------------------------
#BNBClose 10 1.568 0.219 1.217 1.411 1.659 1.965
#BTCClose 10 4,507.324 220.613 4,229.360 4,339.010 4,731.635 4,826.480
#ADAClose 10 0.022 0.002 0.019 0.021 0.022 0.026
#LINKClose 10 0.408 0.043 0.346 0.385 0.440 0.476
#DODGEClose 10 0.001 0.00004 0.001 0.001 0.001 0.001
------------------------------------------------------------------------
Related Topics
Dplyr::Select One Column and Output as Vector
Hyperlinking Text in a Ggplot2 Visualization
R Change All Columns of Type Factor to Numeric
Convert and Save Distance Matrix to a Specific Format
Assign Headers Based on Existing Row in Dataframe in R
How to Install Multiple Packages
How to Remove "Rows" with a Na Value
Clustering List for Hclust Function
More Efficient Means of Creating a Corpus and Dtm with 4M Rows
Select Unique Values with 'Select' Function in 'Dplyr' Library
Automating Version Increase of R Packages
Adding Lagged Variables to an Lm Model
Installing a Package Offline from Github
Manipulating Multiple Files in R
Compute Rolling Sum by Id Variables, with Missing Timepoints
Error: --With-Readline=Yes (Default) and Headers/Libs Are Not Available