Looping Through T.Tests for Data Frame Subsets in R

Looping through t.tests for data frame subsets in r

Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.

One group versus all other groups combined:

First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:

sapply(names(iris)[-ncol(iris)], function(x){
             t.test(iris[iris$Species=="setosa", x], 
                    iris[iris$Species!="setosa", x])$p.value 
                    })
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60

Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.

One group versus each of the other groups:

In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:

comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
                     names(iris)[-ncol(iris)] # excluding group column
                     )
head(comps)
        Var1         Var2
1 versicolor Sepal.Length
2  virginica Sepal.Length
3 versicolor  Sepal.Width
4  virginica  Sepal.Width
5 versicolor Petal.Length
6  virginica Petal.Length

Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.

comps$pval <- apply(comps, 1, function(x) {
    t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value 
    } )

where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:

head(comps)
        Var1         Var2         pval
1 versicolor Sepal.Length 3.746743e-17
2  virginica Sepal.Length 3.966867e-25
3 versicolor  Sepal.Width 2.484228e-15
4  virginica  Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6  virginica Petal.Length 9.269628e-50

All combinations of groups:

You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:

comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])

This now has three columns. The first two are the groups, and the third the variable:

head(comps)
        Var1       Var2         Var3
1     setosa     setosa Sepal.Length
2 versicolor     setosa Sepal.Length
3  virginica     setosa Sepal.Length
4     setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6  virginica versicolor Sepal.Length

You can use this to carry out the tests:

comps$pval <- apply(comps, 1, function(x) {
  t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value 
} )

I get an error message: what should I do?

t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.

A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:

smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations

failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"

That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.

Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.

How to write a loop to run the t-test of a data frame?

Here's a simple solution, which doesn't require additional packages:

lapply(testData[-1], function(x) t.test(x ~ testData$Label))

Here testData[-1] refers to all columns of testData but the first one (which contains the labels). Negative indexing is used for excluding data.

Nested loop - analysis of one variable by subsetting on two others variables

Comments on your code

First, about your loop, it can't fill in the data frame because you're calling the wrong index. For example:

for(j in 1:3){
  for(i in 1:4){
    results[j] <- something[j]
  }
}

In that case, j will only loop between 1 and 3, rewriting the previous result at each occurrence of the inner loop (in other words, you will write 3 times something in results[1], 3 times in results[2], ...). What you want to do is along those lines:

for(j in 0:2){
  for(i in 0:3){
    results[j*3 + i + 1] <- something[j]
  }
}

so that when i=j=0, you write in result[1], when i=1,j=0, you write in results[2], ..., when i=0,j=1 you write in results[4], ..., when i=3,j=2 you write in results[12]. This could be enough to make the loop do what you want.

In addition, there are two small things that are not best practice but shouldn't affect the results: I think all your as.vector() are not useful and have no effect, and adding rows to a data frame during a loop is not a great idea.

For the second one, the idea is that a data frame is usually stored in a consecutive range in memory (same for a vector or matrix). When you add a row, you need to append something to where the data frame is already stored, if there is no space the whole data frame will get copied, which is slow and inefficient. When using a for loop, you always want to initialize your results variables with the right length:

N <- 12 #the length you want
results <- data.frame(n = rep(NA, N),
                      ameans = rep(NA, N),
                      CIameanslower = rep(NA, N),
                      CIameansupper = rep(NA, N))
# or an easier equivalent way:
results <- matrix(NA, nrow=N, ncol=4)
results <- as.data.frame(results)
names(results) <- c("n", "ameans", "CIameanslower", "CIameansupper")

But in R, that is rarely a concern since we can usually vectorize the operations.

How to vectorize

You can do everything with base R, but why not use the best tools available: here it'll be much easier with the tidyverse (in particular the package dplyr).

library(tidyverse)

Now we can transform the original data frame.

graphdata1 %>%
  group_by(Gl, CS) %>%
  summarize(mean_RC = mean(RC),
            sd_RC = sd(RC),
            n = n())

So we easily have the mean, sd, and number of observations; you could add any summary statistic here.
But you want to do a t test. If I understand correctly, you want a one-sample test, comparing the mean in your sample to 0. You could try simply adding it in summarize:

graphdata1 %>%
  group_by(Gl, CS) %>%
  summarize(mean_RC = mean(RC),
            sd_RC = sd(RC),
            n = n(),
            t_test = t.test(RC))
# Error: Problem with `summarise()` input `t_test`.
# x Input `t_test` must be a vector, not a `htest` object.
# i Input `t_test` is `t.test(RC)`.
# i The error occurred in group 1: Gl = "c", CS = "1".

It doesn't work. But look at the error message: the test worked, but you can't just put the result of a test in a data frame. A magic trick is to use a "list-column": one of the columns of our data frame will be a list, that can contain anything, even whole test results.

graphdata1 %>%
  group_by(Gl, CS) %>%
  summarize(mean_RC = mean(RC),
            sd_RC = sd(RC),
            n = n(),
            res = list(t.test(RC)),
            .groups="drop")

I also added .groups="drop" to avoid having a grouping afterwards that may affect subsequent operations.

All we're left to do is extract the values of interest from the test results that are stored. There is again a trick: we need to specify that we want to do the computations row by row and not column by column, with rowwise().

graphdata1 %>%
  group_by(Gl, CS) %>%
  summarize(mean_RC = mean(RC),
            sd_RC = sd(RC),
            n = n(),
            res = list(t.test(RC)),
            .groups="drop") %>%
  rowwise() %>%
  mutate(lower.ci = res$conf.int[1],
         upper.ci = res$conf.int[2])

And we're done! We can use select() to remove the columns that are not interesting anymore and rename and order the ones to keep, and arrange() to sort the rows by 1 or more variables.

graphdata1 %>%
  group_by(Gl, CS) %>%
  summarize(mean_RC = mean(RC),
            sd_RC = sd(RC),
            n = n(),
            res = list(t.test(RC)),
            .groups="drop") %>%
  rowwise() %>%
  mutate(lower.ci = res$conf.int[1],
         upper.ci = res$conf.int[2]) %>%
  select(Gl, CS, mean_RC,
         conf_low = lower.ci, conf_high = upper.ci) %>%
  arrange(rev(Gl), CS)
#     Gl    CS    mean_RC conf_low conf_high
#    <fct> <fct>   <dbl>    <dbl>     <dbl>
# 1  a     1        213.     181.      245.
# 2  a     2        225.     190.      260.
# 3  a     3        257.     229.      285.
# 4  a     4        221.     184.      257.
# 5  b     1        242.     214.      270.
# 6  b     2        255.     222.      288.
# 7  b     3        225.     196.      255.
# 8  b     4        236.     207.      264.
# 9  c     1        248.     218.      277.
# 10 c     2        257.     224.      291.
# 11 c     3        258.     226.      289.
# 12 c     4        245.     207.      283.

How to loop though columns and subset the dataset according to the same value

Note that i and j in your code is a string, but actually you wanted to extract that column, like

for(j in race){
  for (i in race){

    df_1 <- subset(x, x[,i] == 1)
    df_2 <- subset(x, x[,j] == 1)
    print(paste(i, j, sep = " "))
    print(t.test(df_1$DV, df_2$DV) )


  }
}

With regarding to a better way of looping, it seems the dummy variable White, Latino, Black and Asian is mutually exclusive, therefore, perhaps we could rearrange data into

      race  DV
   ------------
1    Black   1
2    White   2
3   Latino   3
4    Black   4
5    Asian   5

and invoke t.test with formula, like

# generate synthetic data
rnd.race <- sample(1:4, 50, replace=T)
x <- data.frame(
  White = as.integer(rnd.race == 1),
  Latino = as.integer(rnd.race == 2),
  Black = as.integer(rnd.race == 3),
  Asian = as.integer(rnd.race == 4),
  DV = seq(1: length(rep(0:1, 50)))
)

race <- c("White", "Latino", "Black", "Asian")

# rearrange data, gather columns of dummy variables
x.cleaned = data.frame(
  race = race[apply(x[,1:4], 1, which.max)],
  DV = x$DV
)

t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))

# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -25.241536   9.483961
# sample estimates:
# mean in group Black mean in group White 
#            47.66667            55.54545 
#

The eensy benefit of using t.test with formula is its readability. For example, in the report of t.test, instead of mean in group x and mean in group y, it will say mean in group Black, mean in group White, and the formula itself states the variable at which we are testing covariant against.

To run t-test iteratively across all pairs, we could

run.test = function(race.pair) {
    list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
}

combn(race, 2, FUN = run.test)

# [[1]]
# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.30892, df = 41.997, p-value = 0.7589
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -21.22870  15.59233
# sample estimates:
# mean in group Latino  mean in group White 
#             52.72727             55.54545 
# 
# 
# [[2]]
# 
#     Welch Two Sample t-test
# 
# data:  DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -25.241536   9.483961
# sample estimates:
# mean in group Black mean in group White 
#            47.66667            55.54545 
# 
# ...

where combn(x, m, FUN = NULL, simplify = TRUE, ...) is a builtin to generate all combinations of the elements of x taken m at a time. For a more generate case using outer, see @askrun's answer.

Finally, IMHO, perhaps ANOVA is more widely recognized than t-test when comparing means between three or more groups (may also suggest why it is "inconvenient" to use t-test iteratively over pairs of groups).

With x.cleaned, we can easily use ANOVA in R, like:

aov.out = aov(DV ~ race, data=x.cleaned)
summary(aov.out)

Note that after one-way ANOVA (test if some of the group means are different), we may also run Post Hoc tests (like TukeyHSD(aov.out)) to find out specific pairs of group has different means. A few tests of assumptions are also de rigueur in a formal report. Here is a lecture notes related to this. And this is a related question on Cross-Validated (where further questions on which test to choose could be answered).

Writing loop or function for various t-tests

How about:

df_1_08 <- data.frame(year = 2008, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))
df_1_09 <- data.frame(year = 2009, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))
df_1_10 <- data.frame(year = 2010, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))
df_2_08 <- data.frame(year = 2008, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))
df_2_09 <- data.frame(year = 2009, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))
df_2_10 <- data.frame(year = 2010, a = runif(100, 0, 100), b = runif(100, 0, 100), c = runif(100, 0, 100), d = runif(100, 0, 100))

dfs_1.names <- ls()[grep("df_1", ls())]
dfs_2.names <- ls()[grep("df_2", ls())]
dfs_1.list <-lapply(dfs_1.names, get)
dfs_2.list <- lapply(dfs_2.names, get)

#in case you want to try the matrix
dfs_1.mtrx <- do.call("rbind",dfs_1.list)
dfs_2.mtrx <- do.call("rbind",dfs_2.list)

years <- intersect(unique(dfs_1.mtrx[,"year"]),unique(dfs_2.mtrx[,"year"]))
# [1] 2008 2009 2010
columns <- intersect(colnames(dfs_1.mtrx[,-1]),colnames(dfs_2.mtrx[,-1]))
# [1] "a" "b" "c" "d"

df.ttest <- as.data.frame(matrix(NA, ncol = 8, nrow = length(years)*length(columns)))
colnames(df.ttest) <- c("year","column","tstat","p.value","degreesf","low.conf","up.conf","data.name")
count = 0
for(i in 1:length(years)){
  for(j in columns){
    ttest <- t.test(dfs_1.list[[i]][j], dfs_2.list[[i]][j])
    ttest$data.name <- paste(paste0("df_1_",years[i]-2000,"$",j),"and",
                             paste0("df_2_",years[i]-2000,"$",j))
    count <- count + 1
    df.ttest[count, "year"]     <- years[i]
    df.ttest[count, "column"]   <- j
    df.ttest[count, "tstat"]    <- ttest$statistic
    df.ttest[count, "p.value"]  <- ttest$p.value
    df.ttest[count, "degreesf"] <- ttest$parameter
    df.ttest[count, "low.conf"] <- ttest$conf.int[1]
    df.ttest[count, "up.conf"]  <- ttest$conf.int[2]   
    df.ttest[count, "data.name"] <- ttest$data.name
  }
}
df.ttest

Which looks like:

   year column      tstat    p.value degreesf    low.conf   up.conf               data.name
1  2008      a  1.0607688 0.29008725 197.9914  -3.7038792 12.327117   df_1_8$a and df_2_8$a
2  2008      b  0.3311722 0.74086573 197.3689  -6.6956039  9.398291   df_1_8$b and df_2_8$b
3  2008      c  1.0410813 0.29910773 197.9405  -3.7582835 12.164152   df_1_8$c and df_2_8$c
4  2008      d  1.2623350 0.20834791 193.4532  -2.9384999 13.387911   df_1_8$d and df_2_8$d
5  2009      a -0.5764091 0.56500626 194.1686 -10.1442158  5.555762   df_1_9$a and df_2_9$a
6  2009      b -1.5222524 0.12954190 197.9248 -14.4317793  1.857603   df_1_9$b and df_2_9$b
7  2009      c -0.1744245 0.86171283 195.0217  -8.6590932  7.251902   df_1_9$c and df_2_9$c
8  2009      d  0.0839337 0.93319409 197.6654  -7.5768817  8.250526   df_1_9$d and df_2_9$d
9  2010      a  1.9125742 0.05724768 197.7406  -0.2353887 15.378495 df_1_10$a and df_2_10$a
10 2010      b  0.9024489 0.36792603 196.0224  -4.0977460 11.011904 df_1_10$b and df_2_10$b
11 2010      c -0.9735756 0.33145768 197.5899 -12.2641333  4.157135 df_1_10$c and df_2_10$c
12 2010      d  0.8721498 0.38418378 197.8601  -4.5311820 11.717207 df_1_10$d and df_2_10$d

Subset data frame within a for loop

Don't use assign use a list instead!

# for loop approach
results = list()
for(nm in names(data)[-1]) { # omit the first column
  results[[nm]] = data[data[[nm]] %in% "Y", "Column I want", drop = FALSE]
}

# lapply approach
results = lapply(data[-1], function(col) data[col %in% "Y", "Column I want", drop = FALSE])

The drop = FALSE arguments makes sure you get 1-column data frames, not vectors, as the result.

As for the issue in your approach, names[i] is just a string, so you're testing if, say, "var2" == "Y", which is false.

How to create a loop which creates multiple subset dataframes from a larger data frame?

Your code works fine. Just remove list so you create a vector of color names and not a list. If you only want distinct values, use unique.

mydata <- data.frame(x = c(1,2,3), y = c('a','b','c'), z = c('red','red','yellow'))

colors <- unique(mydata$z)

for (i in 1:length(colors)) {
    assign(paste0("mydata_",i), subset(mydata, z == colors[[i]]))
    }

Iterative t-tests in dataframe with lapply

How about this:

Since no example given, I am going to use mtcars as dataset which is present in R environment.

mtcars_subst <- mtcars[,c("mpg", "drat", "hp", "am")] #Sample data
#Final code for running t.test
lapply(mtcars_subst[,c(1,2,3)], function(x)t.test(x ~ am, data=mtcars_subst))
 # iteratively running t.test on column mpg, drat and hp basis am as the category

Updated after clarification:

Logic:

Filtering the cyl values on 4 and 6 , There are in total of three categories in cyl (4,6 and 8) , to select only 4,6. I have used %in% here. Either you filter within lapply or subset the required data outside then use lapply.

mtcars_subst <- mtcars[,c("mpg", "drat", "hp", "cyl")]
lapply(mtcars_subst[mtcars_subst$cyl %in% c(4,6),c(1,2,3)], function(x)t.test(x ~ cyl, data=mtcars_subst[mtcars_subst$cyl %in% c(4,6),])) #comparing certain 2 groups

Looping Through T.Tests for Data Frame Subsets in R