How to Group by All But One Columns

How to group by all but one columns?

You can do this using standard evaluation (group_by_ instead of group_by):

# Fake data
set.seed(492)
dat = data.frame(value=rnorm(1000), g1=sample(LETTERS,1000,replace=TRUE),
g2=sample(letters,1000,replace=TRUE), g3=sample(1:10, replace=TRUE),
other=sample(c("red","green","black"),1000,replace=TRUE))

dat %>% group_by_(.dots=names(dat)[-grep("value", names(dat))]) %>%
summarise(meanValue=mean(value))
       g1     g2    g3  other   meanValue
<fctr> <fctr> <int> <fctr> <dbl>
1 A a 2 green 0.89281475
2 A b 2 red -0.03558775
3 A b 5 black -1.79184218
4 A c 10 black 0.17518610
5 A e 5 black 0.25830392
...

See this vignette for more on standard vs. non-standard evaluation in dplyr.

UPDATE for dplyr 0.7.0

To address @ÖmerAn's comment: It looks like group_by_at is the way to go in dplyr 0.7.0 (someone please correct me if I'm wrong about this). For example:

dat %>% 
group_by_at(setdiff(names(dat), "value")) %>%
summarise(meanValue=mean(value))
# Groups:   g1, g2, g3 [?]
g1 g2 g3 other meanValue
<fctr> <fctr> <int> <fctr> <dbl>
1 A a 2 green 0.89281475
2 A b 2 red -0.03558775
3 A b 5 black -1.79184218
4 A c 10 black 0.17518610
5 A e 5 black 0.25830392
6 A e 5 red -0.81879788
7 A e 7 green 0.30836054
8 A f 2 green 0.05537047
9 A g 1 black 1.00156405
10 A g 10 black 1.26884303
# ... with 949 more rows

Let's confirm both methods give the same output (in dplyr 0.7.0):

new = dat %>% 
group_by_at(setdiff(names(dat), "value")) %>%
summarise(meanValue=mean(value))

old = dat %>%
group_by_(.dots=names(dat)[-grep("value", names(dat))]) %>%
summarise(meanValue=mean(value))

identical(old, new)
# [1] TRUE

How to group by a dataframe by all columns except one column (data frame has more than 50 columns)

We can use dplyr::across in group_by:

library(dplyr)

df %>%
group_by(across(-logg_overall_assess_current)) %>%
summarise(mean = mean(logg_overall_assess_current, na.rm = TRUE))

#> `summarise()` has grouped output by 'Bedroom', 'TypeDwel1.2.Duplex',
#> 'TypeDwelApartment.Condo', 'TypeDwelOther', 'TypeDwelTownhouse', 'Age_new.
#> 70', 'Age_new0.1', 'Age_new16.40', 'Age_new2.5', 'Age_new41.70',
#> 'Age_new6.15', 'LandFreehold', 'LandLeasehold.prepaid', 'LandOthers',
#> 'cluster_K_mean.1', 'cluster_K_mean.2', 'cluster_K_mean.3', 'cluster_K_mean.
#> 4', 'Sold.Year.2018', 'Sold.Year.2019', 'Sold.Year.2020', 'S.A.Cambie',
#> 'S.A.Champlain.Heights', 'S.A.Coal.Harbour', 'S.A.Collingwood.VE',
#> 'S.A.Downtown.VE', 'S.A.Downtown.VW', 'S.A.Dunbar', 'S.A.Fairview.VW',
#> 'S.A.False.Creek', 'S.A.Fraser.VE', 'S.A.Fraserview.VE', 'S.A.Grandview.VE',
#> 'S.A.Grandview.Woodland', 'S.A.Hastings', 'S.A.Hastings.Sunrise',
#> 'S.A.Kerrisdale', 'S.A.Killarney.VE', 'S.A.Kitsilano', 'S.A.Knight',
#> 'S.A.Main', 'S.A.Marpole', 'S.A.Mount.Pleasant.VE', 'S.A.Mount.Pleasant.VW',
#> 'S.A.Oakridge.VW', 'S.A.Point.Grey', 'S.A.Quilchena', 'S.A.Renfrew.VE',
#> 'S.A.S.W..Marine', 'S.A.Shaughnessy', 'S.A.South.Cambie', 'S.A.South.Granville',
#> 'S.A.South.Marine', 'S.A.South.Vancouver', 'S.A.Strathcona',
#> 'S.A.University.VW', 'S.A.Victoria.VE', 'S.A.West.End.VW', 'S.A.Yaletown'. You
#> can override using the `.groups` argument.
#> # A tibble: 2 x 61
#> # Groups: Bedroom, TypeDwel1.2.Duplex, TypeDwelApartment.Condo,
#> # TypeDwelOther, TypeDwelTownhouse, Age_new.70, Age_new0.1, Age_new16.40,
#> # Age_new2.5, Age_new41.70, Age_new6.15, LandFreehold, LandLeasehold.prepaid,
#> # LandOthers, cluster_K_mean.1, cluster_K_mean.2, cluster_K_mean.3,
#> # cluster_K_mean.4, Sold.Year.2018, Sold.Year.2019, Sold.Year.2020,
#> # S.A.Cambie, S.A.Champlain.Heights, S.A.Coal.Harbour, S.A.Collingwood.VE,
#> # S.A.Downtown.VE, S.A.Downtown.VW, S.A.Dunbar, S.A.Fairview.VW,
#> # S.A.False.Creek, S.A.Fraser.VE, S.A.Fraserview.VE, S.A.Grandview.VE,
#> # S.A.Grandview.Woodland, S.A.Hastings, S.A.Hastings.Sunrise, S.A.Kerrisdale,
#> # S.A.Killarney.VE, S.A.Kitsilano, S.A.Knight, S.A.Main, S.A.Marpole,
#> # S.A.Mount.Pleasant.VE, S.A.Mount.Pleasant.VW, S.A.Oakridge.VW,
#> # S.A.Point.Grey, S.A.Quilchena, S.A.Renfrew.VE, S.A.S.W..Marine,
#> # S.A.Shaughnessy, S.A.South.Cambie, S.A.South.Granville, S.A.South.Marine,
#> # S.A.South.Vancouver, S.A.Strathcona, S.A.University.VW, S.A.Victoria.VE,
#> # S.A.West.End.VW, S.A.Yaletown [2]
#> Bedroom TypeDwel1.2.Duplex TypeDwelApartment.C… TypeDwelOther TypeDwelTownhou…
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 0 0
#> 2 2 0 1 0 0
#> # … with 56 more variables: Age_new.70 <dbl>, Age_new0.1 <dbl>,
#> # Age_new16.40 <dbl>, Age_new2.5 <dbl>, Age_new41.70 <dbl>,
#> # Age_new6.15 <dbl>, LandFreehold <dbl>, LandLeasehold.prepaid <dbl>,
#> # LandOthers <dbl>, cluster_K_mean.1 <dbl>, cluster_K_mean.2 <dbl>,
#> # cluster_K_mean.3 <dbl>, cluster_K_mean.4 <dbl>, Sold.Year.2018 <dbl>,
#> # Sold.Year.2019 <dbl>, Sold.Year.2020 <dbl>, S.A.Cambie <dbl>,
#> # S.A.Champlain.Heights <dbl>, S.A.Coal.Harbour <dbl>,
#> # S.A.Collingwood.VE <dbl>, S.A.Downtown.VE <dbl>, S.A.Downtown.VW <dbl>,
#> # S.A.Dunbar <dbl>, S.A.Fairview.VW <dbl>, S.A.False.Creek <dbl>,
#> # S.A.Fraser.VE <dbl>, S.A.Fraserview.VE <dbl>, S.A.Grandview.VE <dbl>,
#> # S.A.Grandview.Woodland <dbl>, S.A.Hastings <dbl>,
#> # S.A.Hastings.Sunrise <dbl>, S.A.Kerrisdale <dbl>, S.A.Killarney.VE <dbl>,
#> # S.A.Kitsilano <dbl>, S.A.Knight <dbl>, S.A.Main <dbl>, S.A.Marpole <dbl>,
#> # S.A.Mount.Pleasant.VE <dbl>, S.A.Mount.Pleasant.VW <dbl>,
#> # S.A.Oakridge.VW <dbl>, S.A.Point.Grey <dbl>, S.A.Quilchena <dbl>,
#> # S.A.Renfrew.VE <dbl>, S.A.S.W..Marine <dbl>, S.A.Shaughnessy <dbl>,
#> # S.A.South.Cambie <dbl>, S.A.South.Granville <dbl>, S.A.South.Marine <dbl>,
#> # S.A.South.Vancouver <dbl>, S.A.Strathcona <dbl>, S.A.University.VW <dbl>,
#> # S.A.Victoria.VE <dbl>, S.A.West.End.VW <dbl>, S.A.Yaletown <dbl>,
#> # S.A.Other <dbl>, mean <dbl>

Created on 2022-01-15 by the reprex package (v0.3.0)

How to group by all but except one column

you can use

Select Name,Title,Date from Books group by Name,Title,Date;

or

Select distinct Name,Title,Date from Books;

for the same purpose. It wont make any difference as the dates are same.

if dates are not same then you can use any aggregate function to achieve the same. like:

Select Name,Title,max(Date)--any aggregate function based on logic 
from Books group by Name,Title,Date;

Grouping by all columns except one to count distinct values on the remaining column

One other way to do what you want would be to select distinct rows first, then apply the windowed COUNT() function:

WITH distinctrows AS (
SELECT DISTINCT
Column1,
Column2,
Column3,
Column4
FROM TableA
)
SELECT
Column1,
Column2,
Column3,
Column4,
COUNT(Column4) OVER (PARTITION BY Column1, Column2, Column3) AS Column5
FROM distinctrows
;

Select multiple columns from a table, but group by one

I use this trick to group by one column when I have a multiple columns selection:

SELECT MAX(id) AS id,
Nume,
MAX(intrare) AS intrare,
MAX(iesire) AS iesire,
MAX(intrare-iesire) AS stoc,
MAX(data) AS data
FROM Produse
GROUP BY Nume
ORDER BY Nume

This works.

Select all columns with GROUP BY one column

distinct on

select distinct on (key) *
from t
order by key, name

Notice that the order by clause determines which row will win the ties.

Pandas Groupby and Sum Only One Column

The only way to do this would be to include C in your groupby (the groupby function can accept a list).

Give this a try:

df.groupby(['A','C'])['B'].sum()

One other thing to note, if you need to work with df after the aggregation you can also use the as_index=False option to return a dataframe object. This one gave me problems when I was first working with Pandas. Example:

df.groupby(['A','C'], as_index=False)['B'].sum()

How to use multiple columns but only one column in GROUP BY and ORDER BY

I think this is what you want:

SELECT 
r.id_rekrutmen,
r.judul_rekrutmen,
MAX(rw.tanggal_rekrutmen) AS tanggal_rekrutmen
FROM rekrutmen r
INNER JOIN rekrutmen_waktu rw
ON r.id_rekrutmen = rw.id_rekrutmen
GROUP BY
r.id_rekrutmen,
r.judul_rekrutmen,

How do I summarise all columns except one(s) I specify?

Edit:

Modified versions of the two methods below for dplyr version >= 1, since summarise_at is superseded

df %>% 
summarise(across(where(is.numeric) & !Registered, sum))

df %>%
summarise(across(-Registered, sum))

Original Answer:

I would use summarise_at, and just make a logical vector which is FALSE for non-numeric columns and Registered and TRUE otherwise, i.e.

df %>% 
summarise_at(which(sapply(df, is.numeric) & names(df) != 'Registered'), sum)

If you wanted to just summarise all but one column you could do

df %>% 
summarise_at(vars(-Registered), sum)

but in this case you have to check if it's numeric also.

Notes:

  • factors are technically numeric, so if you want to exclude non-numeric columns and factors, replace sapply(df, is.numeric) with sapply(df, function(x) is.numeric(x) & !is.factor(x))

  • If your data is big I think it is faster to use sapply(df[1,], is.numeric) instead of sapply(df, is.numeric). (Someone please correct me if I'm wrong)



Related Topics



Leave a reply



Submit