Row-Wise Sum of Values Grouped by Columns with Same Name

Row-wise sum of values grouped by columns with same name

We can transpose dat , calculate rowsum per group (colnames of the original dat), then transpose the result back to original structure.

t(rowsum(t(dat), group = colnames(dat), na.rm = T))
# A C G T
#1 1 0 1 0
#2 4 0 6 0
#3 0 1 0 1
#4 2 0 1 0
#5 1 0 1 0
#6 0 1 0 1
#7 0 1 0 1

Sum row-wise values that are grouped by column name but keep all columns in R?

You can try ave like below (with aids of col + row)

> ave(myMat,colnames(myMat)[col(myMat)], row(myMat), FUN = sum)
x y x y
[1,] 1 3 1 3
[2,] 5 9 5 9
[3,] 4 13 4 13

How do I sum all observations row-wise for all columns that have the same name?

We can split the dataset based on the names of the dataset and apply the rowSums on the list of datasets with same name

do.call(cbind, lapply(split.default(dfN, names(dfN)), rowSums, na.rm = TRUE))
# BRANT HALDIMAND.NORFOLK HALTON HAMILTON MUSKOKA NIAGARA PEEL TORONTO YORK
#[1,] 2943 2167 11874 11155 951 8598 32593 60779 23852
#[2,] 2822 2053 11817 11850 1040 8745 34110 67638 24678
#[3,] 2943 2179 11579 12051 1051 8853 34387 70146 24566

Or as @thelatemail mentioned, if we need a data.frame output, wrap the list output with data.frame

data.frame(lapply(split.default(dfN, names(dfN)), rowSums, na.rm = TRUE)) 

Or using tidyverse

library(tidyverse)
dfN %>%
split.default(names(.)) %>%
map_df(reduce, `+`)
# A tibble: 3 x 9
# BRANT HALDIMAND.NORFOLK HALTON HAMILTON MUSKOKA NIAGARA PEEL TORONTO YORK
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 2943 2167 11874 11155 951 8598 32593 60779 23852
#2 2822 2053 11817 11850 1040 8745 34110 67638 24678
#3 2943 2179 11579 12051 1051 8853 34387 70146 24566

data

dfN <- structure(list(TORONTO = c(20855L, 23281L, 24130L), HALTON = c(4011L, 
3997L, 3900L), PEEL = c(11178L, 11770L, 11810L), YORK = c(8138L,
8417L, 8306L), BRANT = c(996L, 961L, 972L), HALDIMAND.NORFOLK = c(739L,
684L, 732L), HAMILTON = c(3835L, 4095L, 4168L), MUSKOKA = c(305L,
343L, 334L), NIAGARA = c(2923L, 2970L, 2985L), TORONTO = c(39924L,
44357L, 46016L), HALTON = c(7863L, 7820L, 7679L), PEEL = c(21415L,
22340L, 22577L), YORK = c(15714L, 16261L, 16260L), BRANT = c(1947L,
1861L, 1971L), HALDIMAND.NORFOLK = c(1428L, 1369L, 1447L), HAMILTON = c(7320L,
7755L, 7883L), MUSKOKA = c(646L, 697L, 717L), NIAGARA = c(5675L,
5775L, 5868L)), class = "data.frame", row.names = c(NA, -3L))

How to perform Row wise sum based on column condition and add Class Wise specific value as Column?

Try with groupby:

df["Cluster_Sum"] = df.groupby("Cluster")["Value"].transform("sum")

>>> df
Cluster Class Value Cluster_Sum
0 0 10 1 9
1 0 11 1 9
2 0 14 3 9
3 0 18 1 9
4 0 26 1 9
5 0 29 1 9
6 0 30 1 9
7 1 0 2 12
8 1 19 1 12
9 1 20 1 12
10 1 21 2 12
11 1 36 1 12
12 1 26 1 12
13 1 27 1 12
14 1 37 2 12
15 1 33 1 12

R sum row values based on column name

Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.

library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
result[[name_stems[i]]] =
rowSums(data[
str_detect(
names(data),
pattern = paste0(name_stems[i], "_")
)
])
}

result
# # A tibble: 3 × 3
# date x1 x2
# <chr> <dbl> <dbl>
# 1 1/1/2018 9 12
# 2 2/1/2018 399 132
# 3 3/1/2018 999 932

Group by month, sum rows based in column, and keep the other columns

IIUC, you need a single groupby. You need to rework your "revenue" column as numeric.

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

group = df['date'].dt.strftime('%b')

(df.assign(revenue=pd.to_numeric(df['revenue'].str.replace(',', '.')))
.groupby([group, 'name', 'type'])
.agg('sum')
.reset_index()
)

Output:

   date name      type      size  revenue
0 Apr A Basic 6908746 0.1
1 Dec A Standard 248753 0.4
2 Dec B Premium 82346 0.5
3 Feb A Basic 3356943 0.1
4 Feb D Premium 12049667 2.1
5 Feb E Standard 486 0.9
6 Jan C Basic 3589749 0.4
7 Mar A Basic 28586 0.2
8 Mar B Basic 8734684 0.1
9 Mar D Premium 192 0.7
10 Oct F Basic 23847 0.3

Note that the above is aggregating months of different years into the same group. If you want to keep years separate, use a period:

group = df['date'].dt.to_period('M')

Output:

       date name      type      size  revenue
0 2021-01 C Basic 3589749 0.4
1 2021-02 A Basic 3356943 0.1
2 2021-04 A Basic 6908746 0.1
3 2021-10 F Basic 23847 0.3
4 2021-12 A Standard 248753 0.4
5 2021-12 B Premium 82346 0.5
6 2022-02 D Premium 12049667 2.1
7 2022-02 E Standard 486 0.9
8 2022-03 A Basic 28586 0.2
9 2022-03 B Basic 8734684 0.1
10 2022-03 D Premium 192 0.7

Row-wise sum of paired values in 2 identically labelled lists

  1. Since you say that they should be added "by the shared Time value", we cannot use bind_cols, which would require perfect alignment and shared-existence of all Time values. (If this is always the case, then just ignore it ... and pray that your assumption is always true.) I suggest a merge/join operation.

    For discussion on merge/join operations, see How to join (merge) data frames (inner, outer, left, right) and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.

  2. Further, joining on a floating-point is subject to a fundamental problem in programming (not just R), testing for perfect equality in floating-point numbers. To work around this, we need to determine a precision level at which two numbers are assumed to be effectively the same. (Note that often R can get floating-point intuitively correct, but there is no clear indication when it does not work.)

    For more discussion on issues regarding floating-point equality, test (0.10 + 0.05) == 0.15 (returns false), and then read Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.

  3. Doing a merge/join based on common columns will be much simpler if we first pivot the data from the current wide format to a long format. (This relies on all of them being the same class, numeric in this case. If there is mixed-class, then this does not work.)

    For discussion on pivoting, see Reshaping data.frame from wide to long format (wide-to-long) and the reverse, Reshape multiple value columns to wide format (long to wide).

I'll create a similarly-shaped Sis2, changing the numbers, and changing one of the Time values just a little to show what happens.

set.seed(42)
Sis2 <- Sis1 %>%
mutate(across(starts_with("Nuc"), ~ runif(n(), 50, 100)))
Sis2$Time[3] <- Sis2$Time[3] + 1e-5
Sis2
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 95.74030 86.82942 96.73361
# 2 839.84 96.85377 56.73333 62.77144
# 3 844.48 64.30698 82.84961 73.11464
# 4 849.12 91.52238 85.25324 97.00073
# 5 853.76 82.08728 72.88709 98.91132
# 6 858.40 75.95480 85.95561 55.87437

Despite the fact that they look similar, one of the Time values is in fact different, and will not join properly.

Sis1$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis2$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40

Sis1$Time == Sis2$Time
# [1] TRUE TRUE FALSE TRUE TRUE TRUE

Assuming that 0.001 is sufficient precision to determine as "effectively identical", here is working code that will work:

library(dplyr)
# library(tidyr) # pivot_*
full_join(
Sis1 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
Sis2 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
by = c("Time_chr", "name")) %>%
transmute(
Time = coalesce(Time.x, Time.y),
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.7470 187.9607 186.2512
# 2 839.84 200.3492 203.0143 140.2053
# 3 844.48 207.0985 168.8299 156.4784
# 4 849.12 174.0169 175.3819 166.4995
# 5 853.76 145.7154 150.8000 203.3399
# 6 858.40 141.9953 196.4379 164.0018

(The use of %>% as.data.frame() is purely to show all digits and rows without tibble's convenient brief-view. It is not required. Also ... I'm generally not a fan of nesting %>%-pipes inside function calls like that, but it shows what's being done well, I think; feel free to use temporary variables for pivoted storage if you like.)

Because we're doing a full-join, if there were any mismatches then we would see NA values. Such as this, without the use of converting Time to a string:

full_join(
tidyr::pivot_longer(Sis1, -Time),
tidyr::pivot_longer(Sis2, -Time),
by = c("Time", "name")
) %>%
transmute(
Time,
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.74700 187.96075 186.25124
# 2 839.84 200.34921 203.01432 140.20529
# 3 844.48 142.79152 85.98028 83.36374
# 4 849.12 174.01691 175.38195 166.49954
# 5 853.76 145.71543 150.79999 203.33989
# 6 858.40 141.99531 196.43786 164.00176
# 7 844.48 64.30698 82.84961 73.11464

Notice that 844.48 in row 3 and its similar-looking "844.48" in row 7 have not been added, due to the 1e-5 difference I added. You can get R to show them with more precision (see ?options for the 'digits' and 'scipen' arguments).



Related Topics



Leave a reply



Submit