How to Group by Two Columns in R

Group by two column and summarize multiple columns

We can use summarise with across from dplyr version > = 1.00

library(dplyr)
df %>%
group_by(State, Date) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 6 x 4
# State Date Female Male
# <chr> <chr> <int> <int>
#1 Cali 05/06/2005 3 2
#2 Cali 10/06/2005 4 3
#3 NY 11/06/2005 10 5
#4 NY 12/06/2005 11 6
#5 Texas 01/01/2004 5 3
#6 Texas 02/01/2004 5 4

Or using aggregate from base R

aggregate(.~ State + Date, df, sum, na.rm = TRUE)

data

df <-  structure(list(State = c("Texas", "Texas", "Texas", "Cali", "Cali", 
"Cali", "Cali", "NY", "NY"), Female = c(2L, 3L, 5L, 1L, 2L, 3L,
1L, 10L, 11L), Male = c(2L, 1L, 4L, 1L, 1L, 1L, 2L, 5L, 6L),
Date = c("01/01/2004", "01/01/2004", "02/01/2004", "05/06/2005",
"05/06/2005", "10/06/2005", "10/06/2005", "11/06/2005", "12/06/2005"
)), class = "data.frame", row.names = c(NA, -9L))

How to group by multiple columns in R?

In base R you could do:

ftable(prop.table(table(my_data[-1]), c(1, 3)), col.vars = c("Gender", "Employment_status"))

Gender F M
Employment_status Employed Other Unemployed Employed Other Unemployed
Year
2016 1.0 0.0 0.0 0.5 0.5 0.0
2017 1.0 0.0 0.0 0.5 0.5 0.0
2018 1.0 0.0 0.0 0.5 0.5 0.0
2019 1.0 0.0 0.0 1.0 0.0 0.0
2020 0.0 0.0 1.0 1.0 0.0 0.0

R: Group by and Apply a general function to two columns

Update based on real-life example:

You can do a direct approach like this:

library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9

The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.

You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.

I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.


In your example, you can do:

my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}

library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11

But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.

Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

Pandas finding and replacing outliers based on a group of two columns

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 2
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 3

Groupby two columns and create a new column based on a conditional subtraction in python

IIUC, you can use a mask before using groupby.transform('first'):

df['e'] = df['c'] - (df['c'].where(df['b'].eq(2))
.groupby([df['a'], df['d']])
.transform('first')
.convert_dtypes()
)

output:

    a  b  c  d   e
0 R 1 2 p -2
1 R 2 4 p 0
2 R 3 6 p 2
3 R 4 8 p 4
4 R 1 6 o 3
5 R 2 3 o 0
6 R 3 1 o -2
7 R 4 2 o -1
8 S 0 5 n 4
9 S 1 4 n 3
10 S 2 1 n 0
11 S 3 3 n 2
12 S 0 3 g -6
13 S 1 2 g -7
14 S 2 9 g 0
15 S 3 7 g -2

Match values in two columns within group in R

Please note: I am not sure if you made a typo in your dataframe for row ABC-LKJ 210-31.

You can use the or | so that the 1 returns for both of the rows where the values match per group_by. You can use the following code:

df <- data.frame(col1 = c("ABC", "FGC", "ZHU", "IIN", "OIL", "ABC", "LKJ", "SNM", "QQA"),
col2 = c("DLK", "CBN", "ABC", "ZHU", "HSC", "LJK", "QQA", "KDL", "CBV"),
attr = c("100-30", "100-30", "100-30", "21-667", "21-667", "210-31", "201-31", "201-31", "201-31"))

library(dplyr)
df %>%
group_by(attr) %>%
mutate(flag = ifelse(col1 %in% col2 | col2 %in% col1, 1, 0))
#> # A tibble: 9 × 4
#> # Groups: attr [4]
#> col1 col2 attr flag
#> <chr> <chr> <chr> <dbl>
#> 1 ABC DLK 100-30 1
#> 2 FGC CBN 100-30 0
#> 3 ZHU ABC 100-30 1
#> 4 IIN ZHU 21-667 0
#> 5 OIL HSC 21-667 0
#> 6 ABC LJK 210-31 0
#> 7 LKJ QQA 201-31 1
#> 8 SNM KDL 201-31 0
#> 9 QQA CBV 201-31 1

Created on 2022-07-08 by the reprex package (v2.0.1)

How to group by factor levels from two columns and output new column that shows sum of each level in R?

Instead of grouping by 'RawDate', group by 'ID', 'YEAR' and get the sum on a logical vector

library(dplyr)
complete_df %>%
group_by(ID, YEAR) %>%
mutate(TotalWon = sum(Renewal == 'WON'), TotalLost = sum(Renewal == 'LOST'))

If we need a summarised output, use summarise instead of mutate

Subset groups in a data.table using conditions on two columns

With data.table:

DT[,.SD[any(x == 3 & y == 6)], by=group]

group id y x
<char> <int> <num> <num>
1: B 5 6 3
2: B 6 8 3
3: B 7 14 3
4: B 8 19 2
5: C 9 10 2
6: C 10 9 3
7: C 11 6 3
8: C 12 8 3

Another possibly faster option:

DT[, if (any(x == 3 & y == 6)) .SD, by=group]


Related Topics



Leave a reply



Submit