How to Group by Two Columns in R

Group by two column and summarize multiple columns

We can use summarise with across from dplyr version > = 1.00

library(dplyr)
df %>%
   group_by(State, Date) %>%
   summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 6 x 4
#  State Date       Female  Male
#  <chr> <chr>       <int> <int>
#1 Cali  05/06/2005      3     2
#2 Cali  10/06/2005      4     3
#3 NY    11/06/2005     10     5
#4 NY    12/06/2005     11     6
#5 Texas 01/01/2004      5     3
#6 Texas 02/01/2004      5     4

Or using aggregate from base R

aggregate(.~ State + Date, df, sum, na.rm = TRUE)

data

df <-  structure(list(State = c("Texas", "Texas", "Texas", "Cali", "Cali", 
"Cali", "Cali", "NY", "NY"), Female = c(2L, 3L, 5L, 1L, 2L, 3L, 
1L, 10L, 11L), Male = c(2L, 1L, 4L, 1L, 1L, 1L, 2L, 5L, 6L), 
    Date = c("01/01/2004", "01/01/2004", "02/01/2004", "05/06/2005", 
    "05/06/2005", "10/06/2005", "10/06/2005", "11/06/2005", "12/06/2005"
    )), class = "data.frame", row.names = c(NA, -9L))

How to group by multiple columns in R?

In base R you could do:

ftable(prop.table(table(my_data[-1]), c(1, 3)), col.vars = c("Gender", "Employment_status"))

     Gender                   F                         M                 
     Employment_status Employed Other Unemployed Employed Other Unemployed
Year                                                                      
2016                        1.0   0.0        0.0      0.5   0.5        0.0
2017                        1.0   0.0        0.0      0.5   0.5        0.0
2018                        1.0   0.0        0.0      0.5   0.5        0.0
2019                        1.0   0.0        0.0      1.0   0.0        0.0
2020                        0.0   0.0        1.0      1.0   0.0        0.0

R: Group by and Apply a general function to two columns

Update based on real-life example:

You can do a direct approach like this:

library(tidyverse)
library(InfoTrad)
dat %>%
  group_by(ticker, date) %>%
  summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)

# A tibble: 4 x 3
# Groups:   ticker [2]
  ticker  date      PIN
  <chr>  <dbl>    <dbl>
1 A          1 1.05e-17
2 A          2 1.56e- 1
3 B          1 1.12e- 8
4 B          2 7.07e- 9

The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.

You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.

I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.

In your example, you can do:

my_function <- function(data) {
  data %>%
    summarize(rv = sum(ret, vol))
}

library(tidyverse)
df %>%
  group_by(ticker, date) %>%
  my_function()

# A tibble: 4 x 3
# Groups:   ticker [2]
  ticker  date    rv
  <chr>  <dbl> <dbl>
1 A          1     7
2 A          2     5
3 B          1    10
4 B          2    11

But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.

Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

Pandas finding and replacing outliers based on a group of two columns

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

Groupby two columns and create a new column based on a conditional subtraction in python

IIUC, you can use a mask before using groupby.transform('first'):

df['e'] = df['c'] - (df['c'].where(df['b'].eq(2))
                            .groupby([df['a'], df['d']])
                            .transform('first')
                            .convert_dtypes()
                    )

output:

    a  b  c  d   e
0   R  1  2  p  -2
1   R  2  4  p   0
2   R  3  6  p   2
3   R  4  8  p   4
4   R  1  6  o   3
5   R  2  3  o   0
6   R  3  1  o  -2
7   R  4  2  o  -1
8   S  0  5  n   4
9   S  1  4  n   3
10  S  2  1  n   0
11  S  3  3  n   2
12  S  0  3  g  -6
13  S  1  2  g  -7
14  S  2  9  g   0
15  S  3  7  g  -2

Match values in two columns within group in R

Please note: I am not sure if you made a typo in your dataframe for row ABC-LKJ 210-31.

You can use the or | so that the 1 returns for both of the rows where the values match per group_by. You can use the following code:

df <- data.frame(col1 = c("ABC", "FGC", "ZHU", "IIN", "OIL", "ABC", "LKJ", "SNM", "QQA"),
                 col2 = c("DLK", "CBN", "ABC", "ZHU", "HSC", "LJK", "QQA", "KDL", "CBV"),
                 attr = c("100-30", "100-30", "100-30", "21-667", "21-667", "210-31", "201-31", "201-31", "201-31"))

library(dplyr)
df %>%
  group_by(attr) %>%
  mutate(flag = ifelse(col1 %in% col2 | col2 %in% col1, 1, 0))
#> # A tibble: 9 × 4
#> # Groups:   attr [4]
#>   col1  col2  attr    flag
#>   <chr> <chr> <chr>  <dbl>
#> 1 ABC   DLK   100-30     1
#> 2 FGC   CBN   100-30     0
#> 3 ZHU   ABC   100-30     1
#> 4 IIN   ZHU   21-667     0
#> 5 OIL   HSC   21-667     0
#> 6 ABC   LJK   210-31     0
#> 7 LKJ   QQA   201-31     1
#> 8 SNM   KDL   201-31     0
#> 9 QQA   CBV   201-31     1

^{Created on 2022-07-08 by the reprex package (v2.0.1)}

How to group by factor levels from two columns and output new column that shows sum of each level in R?

Instead of grouping by 'RawDate', group by 'ID', 'YEAR' and get the sum on a logical vector

library(dplyr)
complete_df %>%
       group_by(ID, YEAR) %>%
       mutate(TotalWon = sum(Renewal == 'WON'), TotalLost = sum(Renewal == 'LOST'))

If we need a summarised output, use summarise instead of mutate

Subset groups in a data.table using conditions on two columns

With data.table:

DT[,.SD[any(x == 3 & y == 6)], by=group]

    group    id     y     x
   <char> <int> <num> <num>
1:      B     5     6     3
2:      B     6     8     3
3:      B     7    14     3
4:      B     8    19     2
5:      C     9    10     2
6:      C    10     9     3
7:      C    11     6     3
8:      C    12     8     3

Another possibly faster option:

DT[, if (any(x == 3 & y == 6)) .SD, by=group]