Add Row to a Data Frame With Total Sum for Each Column

Add row to a data frame with total sum for each column

See adorn_totals() from the janitor package:

library(janitor)
x %>%
  adorn_totals("row")

#>  Language Files   LOC
#>       C++  4009 15328
#>      Java   210   876
#>    Python    35   200
#>     Total  4254 16404

The numeric columns remain of class numeric.

Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.

add row to a data frame that calculates sums of all numeric columns

The janitor package has this ready to go:


library(janitor)

df %>%
  adorn_totals("row", fill = "Total")

 segment subSegment var.1 var.2 var.3 var.4
    seg1    subseg1   100   200    50    60
    seg1    subseg2    20    30    50    50
    seg2    subseg1    30    30    40    35
    seg2    subseg2    50    70    20    53
    seg3    subseg1    40    30    30    42
    seg3    subseg2    40   140    40    20
   Total      Total   280   500   230   260

add column total to new row in data frame R

You could accomplish this several ways, including some that are newer and more "tidy", but when the solution is straightforward in base R like this I prefer such an approach:

rbind(data, colSums(data[1:3,]),colSums(data))

    A  B  C  D
V1  4  1  1  3
V2  4  2  2  2
V3  4  3  4  4
V4  4  4  4  1
5  12  6  7  9
6  16 10 11 10

If you'd like the row names to match your desired output example then this is 1 option:

data           <- rbind(data, colSums(data[1:3,]),colSums(data))
rownames(data) <- c("V1", "V2", "V3", "V4", "V1:V3Sum", "V1:V4Sum")

          A  B  C  D
V1        4  1  1  3
V2        4  2  2  2
V3        4  3  4  4
V4        4  4  4  1
V1:V3Sum 12  6  7  9
V1:V4Sum 16 10 11 10

RELATIVE FREQUENCIES

You ask for a few more rows to reflect summary stats (relative frequncies). I believe this is what you wanted:

rbind(data, 
      data[1,]/data[5,],
      data[2,]/data[5,],
      data[3,]/data[5,],
      data[4,]/data[5,],
      data[5,]/data[6,])

                   A          B          C          D
V1         4.0000000  1.0000000  1.0000000  3.0000000
V2         4.0000000  2.0000000  2.0000000  2.0000000
V3         4.0000000  3.0000000  4.0000000  4.0000000
V4         4.0000000  4.0000000  4.0000000  1.0000000
V1:V3Sum  12.0000000  6.0000000  7.0000000  9.0000000
V1:V4Sum  16.0000000 10.0000000 11.0000000 10.0000000
V11        0.3333333  0.1666667  0.1428571  0.3333333
V21        0.3333333  0.3333333  0.2857143  0.2222222
V31        0.3333333  0.5000000  0.5714286  0.4444444
V41        0.3333333  0.6666667  0.5714286  0.1111111
V1:V3Sum1  0.7500000  0.6000000  0.6363636  0.9000000

How to sum columns and rows in a wide R dataframe?

Another way is to first summarize and then bind_rows:

library(dplyr)

mydata %>% 
  ungroup() %>% 
  mutate(Total = rowSums(across(where(is.numeric)))) %>%
  bind_rows(summarize(., description.y = "Total", across(where(is.numeric), sum)))

Output

# A tibble: 3 x 13
  description.y `2011`  `2012`  `2013`  `2014`  `2015`  `2016`  `2017`  `2018`  `2019`  `2020` `2021`    Total
  <chr>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
1 X1            13186.  14988.  26289.  15238.  15987.  16325.  16595.  20236.  20627.  19554. 10426.  189450.
2 X2            82444. 103399.  86098.  88540. 113145. 113196. 122168. 120058. 130700. 136464. 56392. 1152605.
3 Total         95630. 118387. 112387. 103778. 129132. 129521. 138762. 140294. 151326. 156018. 66819. 1342055.

Pandas add total row at the end of each unique column value

First reset the index of the dataframe then group the dataframe on Type and aggregate the column Count using sum and index using max, then assign the Item column whose value is Total. Finally .concat the frame with the original dataframe df and .sort the index to maintain the order.

frame = df.reset_index()\
          .groupby('Type', as_index=False)\
          .agg({'Count': 'sum', 'index': 'max'})\
          .assign(Item='Total').set_index('index')

pd.concat([df, frame]).sort_index(ignore_index=True)

Another approach you might want to try (might be faster than the above one):

def summarize():
    for k, g in df.groupby('Type', sort=False):
        yield g.append({'Item': 'Total',
                        'Type': k, 'Location': '',
                        'Count': g['Count'].sum()}, ignore_index=True)


pd.concat(summarize(), ignore_index=True)

which results:

    Item  Type Location  Count
0      1   Dog      USA     10
1      2   Dog       UK     20
2  Total   Dog              30
3      3   Cat    JAPAN     30
4      4   Cat       UK     40
5  Total   Cat              70
6      5  Bird    CHINA     50
7      6  Bird    SPAIN     60
8      7  Bird      UAE     70
9  Total  Bird             180

Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R

It's unclear exactly what you're trying to achieve, and the sample data is too sparse to help disambiguate, but here are my two guesses:

Averages Of Each Marker Within Each Population

This interpretation is most consistent with your sample output, in which each population (cluster) appears only once, as if the data were aggregated by population.

It is very straightforward in R to group data and then summarize it with aggregate functions.

Solution 1.1: `dplyr`

Here's a solution with the dplyr package, which is syntactically intuitive:

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and summarize the average of each marker (column).
  summarize(across(everything(), mean))

Solution 1.2: `data.table`

Here's a solution with data.table, which offers even better performance.

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Summarize the average of each marker (column), as grouped by cluster.
  lapply(.SD, mean), by = `NA`
]

Result

Let the values for data_clean and metaClustering_perCell be as sampled in your question.

While the first result (1.1) will be a tibble and the second (1.2) a data.table, each will contain the following data:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

Cumulative Averages ("") As Of Each Observation

This interpretation is most consistent with your algorithm, which seems to calculate its metrics (average, etc.) on a running basis, for each observation (row).

R also facilitates cumulative averages, sums, and so forth. It is far more efficient to leverage vectorized operations than to compute these metrics iteratively (with loops, the *apply() family, etc.) for each row.

Solution 2.1: `dplyr`

Serendipitously, dplyr already has its own cummean() function.

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and overwrite each marker (column) with its running average.
  mutate(across(everything(), cummean)) %>% ungroup()

Solution 2.2: `data.table`

With data.table we can improvise our own (anonymous) function

function(x) {
  cumsum(x) / seq_along(x)
}

which divides the running sum by the running count, to calculate the cumulative mean along a vector (column). We could also import dplyr and use cummean in place of our function.

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Overwrite each marker (column) with its running average, as grouped by cluster.
  lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]

Result

Let the values for data_clean and metaClustering_perCell be as sampled in your question.

While the first result (1.1) will be a tibble and the second (1.2) a data.table, each will contain the following data:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000  435.0000         399.0000         412.0000     447.0000
1 Population 320.0000  418.5000         392.0000         515.0000     469.0000
1 Population 327.0000  435.3333         387.6667         423.0000     460.6667
1 Population 245.2500  397.2500         400.2500         457.7500     454.7500
1 Population 235.8000  378.4000         397.0000         498.4000     459.2000
1 Population 245.6667  377.1667         386.0000         447.5000     437.3333
1 Population 257.0000  386.4286         388.2857         435.4286     439.5714
1 Population 263.5000  383.5000         388.7500         429.1250     425.3750
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

Pandas dataframe total row

Update June 2022

pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!

Append a totals row with

df.append(df.sum(numeric_only=True), ignore_index=True)

The conversion is necessary only if you have a column of strings or objects.

It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.

baz = 2*df['qux'].sum() + 3*df['bar'].sum()

Add Row to a Data Frame With Total Sum for Each Column