Add Row to a Data Frame With Total Sum for Each Column

Add row to a data frame with total sum for each column

See adorn_totals() from the janitor package:

library(janitor)
x %>%
adorn_totals("row")

#> Language Files LOC
#> C++ 4009 15328
#> Java 210 876
#> Python 35 200
#> Total 4254 16404

The numeric columns remain of class numeric.

Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.

add row to a data frame that calculates sums of all numeric columns

The janitor package has this ready to go:


library(janitor)

df %>%
adorn_totals("row", fill = "Total")

segment subSegment var.1 var.2 var.3 var.4
seg1 subseg1 100 200 50 60
seg1 subseg2 20 30 50 50
seg2 subseg1 30 30 40 35
seg2 subseg2 50 70 20 53
seg3 subseg1 40 30 30 42
seg3 subseg2 40 140 40 20
Total Total 280 500 230 260

add column total to new row in data frame R

You could accomplish this several ways, including some that are newer and more "tidy", but when the solution is straightforward in base R like this I prefer such an approach:

rbind(data, colSums(data[1:3,]),colSums(data))
    A  B  C  D
V1 4 1 1 3
V2 4 2 2 2
V3 4 3 4 4
V4 4 4 4 1
5 12 6 7 9
6 16 10 11 10

If you'd like the row names to match your desired output example then this is 1 option:

data           <- rbind(data, colSums(data[1:3,]),colSums(data))
rownames(data) <- c("V1", "V2", "V3", "V4", "V1:V3Sum", "V1:V4Sum")
          A  B  C  D
V1 4 1 1 3
V2 4 2 2 2
V3 4 3 4 4
V4 4 4 4 1
V1:V3Sum 12 6 7 9
V1:V4Sum 16 10 11 10

RELATIVE FREQUENCIES

You ask for a few more rows to reflect summary stats (relative frequncies). I believe this is what you wanted:

rbind(data, 
data[1,]/data[5,],
data[2,]/data[5,],
data[3,]/data[5,],
data[4,]/data[5,],
data[5,]/data[6,])
                   A          B          C          D
V1 4.0000000 1.0000000 1.0000000 3.0000000
V2 4.0000000 2.0000000 2.0000000 2.0000000
V3 4.0000000 3.0000000 4.0000000 4.0000000
V4 4.0000000 4.0000000 4.0000000 1.0000000
V1:V3Sum 12.0000000 6.0000000 7.0000000 9.0000000
V1:V4Sum 16.0000000 10.0000000 11.0000000 10.0000000
V11 0.3333333 0.1666667 0.1428571 0.3333333
V21 0.3333333 0.3333333 0.2857143 0.2222222
V31 0.3333333 0.5000000 0.5714286 0.4444444
V41 0.3333333 0.6666667 0.5714286 0.1111111
V1:V3Sum1 0.7500000 0.6000000 0.6363636 0.9000000

How to sum columns and rows in a wide R dataframe?

Another way is to first summarize and then bind_rows:

library(dplyr)

mydata %>%
ungroup() %>%
mutate(Total = rowSums(across(where(is.numeric)))) %>%
bind_rows(summarize(., description.y = "Total", across(where(is.numeric), sum)))

Output

# A tibble: 3 x 13
description.y `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020` `2021` Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X1 13186. 14988. 26289. 15238. 15987. 16325. 16595. 20236. 20627. 19554. 10426. 189450.
2 X2 82444. 103399. 86098. 88540. 113145. 113196. 122168. 120058. 130700. 136464. 56392. 1152605.
3 Total 95630. 118387. 112387. 103778. 129132. 129521. 138762. 140294. 151326. 156018. 66819. 1342055.

Pandas add total row at the end of each unique column value

First reset the index of the dataframe then group the dataframe on Type and aggregate the column Count using sum and index using max, then assign the Item column whose value is Total. Finally .concat the frame with the original dataframe df and .sort the index to maintain the order.

frame = df.reset_index()\
.groupby('Type', as_index=False)\
.agg({'Count': 'sum', 'index': 'max'})\
.assign(Item='Total').set_index('index')

pd.concat([df, frame]).sort_index(ignore_index=True)

Another approach you might want to try (might be faster than the above one):

def summarize():
for k, g in df.groupby('Type', sort=False):
yield g.append({'Item': 'Total',
'Type': k, 'Location': '',
'Count': g['Count'].sum()}, ignore_index=True)


pd.concat(summarize(), ignore_index=True)

which results:

    Item  Type Location  Count
0 1 Dog USA 10
1 2 Dog UK 20
2 Total Dog 30
3 3 Cat JAPAN 30
4 4 Cat UK 40
5 Total Cat 70
6 5 Bird CHINA 50
7 6 Bird SPAIN 60
8 7 Bird UAE 70
9 Total Bird 180

Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R

It's unclear exactly what you're trying to achieve, and the sample data is too sparse to help disambiguate, but here are my two guesses:

Averages Of Each Marker Within Each Population

This interpretation is most consistent with your sample output, in which each population (cluster) appears only once, as if the data were aggregated by population.

It is very straightforward in R to group data and then summarize it with aggregate functions.

Solution 1.1: dplyr

Here's a solution with the dplyr package, which is syntactically intuitive:

library(dplyr)

data_clean %>%
# Overwrite the 'NA' column with the cluster labels.
mutate(`NA` = metaClustering_perCell) %>%
# Group by cluster labels...
group_by(`NA`) %>%
# ...and summarize the average of each marker (column).
summarize(across(everything(), mean))

Solution 1.2: data.table

Here's a solution with data.table, which offers even better performance.

library(data.table)

as.data.table(data_clean)[,
# Overwrite the 'NA' column with the cluster labels.
("NA") := metaClustering_perCell
][,
# Summarize the average of each marker (column), as grouped by cluster.
lapply(.SD, mean), by = `NA`
]

Result

Let the values for data_clean and metaClustering_perCell be as sampled in your question.

While the first result (1.1) will be a tibble and the second (1.2) a data.table, each will contain the following data:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667 390.2222 384.8889 426.7778 417.3333
9 Population 328.0000 358.0000 430.0000 265.0000 0.0000

Cumulative Averages ("") As Of Each Observation

This interpretation is most consistent with your algorithm, which seems to calculate its metrics (average, etc.) on a running basis, for each observation (row).

R also facilitates cumulative averages, sums, and so forth. It is far more efficient to leverage vectorized operations than to compute these metrics iteratively (with loops, the *apply() family, etc.) for each row.

Solution 2.1: dplyr

Serendipitously, dplyr already has its own cummean() function.

library(dplyr)

data_clean %>%
# Overwrite the 'NA' column with the cluster labels.
mutate(`NA` = metaClustering_perCell) %>%
# Group by cluster labels...
group_by(`NA`) %>%
# ...and overwrite each marker (column) with its running average.
mutate(across(everything(), cummean)) %>% ungroup()

Solution 2.2: data.table

With data.table we can improvise our own (anonymous) function

function(x) {
cumsum(x) / seq_along(x)
}

which divides the running sum by the running count, to calculate the cumulative mean along a vector (column). We could also import dplyr and use cummean in place of our function.

library(data.table)

as.data.table(data_clean)[,
# Overwrite the 'NA' column with the cluster labels.
("NA") := metaClustering_perCell
][,
# Overwrite each marker (column) with its running average, as grouped by cluster.
lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]

Result

Let the values for data_clean and metaClustering_perCell be as sampled in your question.

While the first result (1.1) will be a tibble and the second (1.2) a data.table, each will contain the following data:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000 435.0000 399.0000 412.0000 447.0000
1 Population 320.0000 418.5000 392.0000 515.0000 469.0000
1 Population 327.0000 435.3333 387.6667 423.0000 460.6667
1 Population 245.2500 397.2500 400.2500 457.7500 454.7500
1 Population 235.8000 378.4000 397.0000 498.4000 459.2000
1 Population 245.6667 377.1667 386.0000 447.5000 437.3333
1 Population 257.0000 386.4286 388.2857 435.4286 439.5714
1 Population 263.5000 383.5000 388.7500 429.1250 425.3750
1 Population 278.6667 390.2222 384.8889 426.7778 417.3333
9 Population 328.0000 358.0000 430.0000 265.0000 0.0000

Pandas dataframe total row

Update June 2022

pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!


Append a totals row with

df.append(df.sum(numeric_only=True), ignore_index=True)

The conversion is necessary only if you have a column of strings or objects.

It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.

baz = 2*df['qux'].sum() + 3*df['bar'].sum()


Related Topics



Leave a reply



Submit