Grouped Operations That Result in Length Not Equal to 1 or Length of Group in Dplyr

grouped operations that result in length not equal to 1 or length of group in dplyr

In dplyr version 0.2 you could do this using the do operator:

> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
#
#  b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2

Error: incompatible size when mutating in dplyr

Just expanding on @allistaire's comment.

Your specified conditions are the cause of the error. specifically, tail(which(backward>0),1)
Given code can be optimised to get rid of the spread()

you can try

dff <- df%>%
  group_by(group)%>%
  mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
  arrange(group)%>%
  mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)])

It seems like you are looking to identify influx points where direction changes, for each group. In this scenario, please clarify exactly how flip is related, or maybe if you change flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0)) to flip <- c(c(0,0,1,1,1,1),c(1,1,0,1,1,1)) so that flip marks change in direction of ff , you can use

dff <- df%>%
  group_by(group)%>%
  mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
  arrange(group)%>%
  mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)]) %>%
  mutate(c2=ff[tail(which(direc=="backward"& flip >0),1)])

which gives:

Source: local data frame [12 x 6]
Groups: group [2]

      ff  flip  group    direc    c1    c2
   <dbl> <dbl> <fctr>    <chr> <dbl> <dbl>
1    0.0     0      1  forward   0.2  -0.2
2    0.1     0      1  forward   0.2  -0.2
3    0.2     1      1  forward   0.2  -0.2
4    0.0     1      1 backward   0.2  -0.2
5   -0.1     1      1 backward   0.2  -0.2
6   -0.2     1      1 backward   0.2  -0.2
7    0.0     1      2  forward   0.0  -0.2
8    0.1     1      2  forward   0.0  -0.2
9    0.2     0      2  forward   0.0  -0.2
10   0.0     1      2 backward   0.0  -0.2
11  -0.1     1      2 backward   0.0  -0.2
12  -0.2     1      2 backward   0.0  -0.2

can I switch the grouping variable in a single dplyr statement?

Try this:

> df %>% group_by(b) %>% mutate(c = cumsum(a)) %>%
+        group_by(c) %>% mutate(d = cumsum(a))
Source: local data frame [4 x 4]
Groups: c

  a b c d
1 1 1 1 1
2 1 2 1 2
3 2 1 3 2
4 2 2 3 4

Update

With newer version of dplyr use %>% rather than %.% and ungroup is no longer needed (as per David Arenburg's comment).

Use group size (`group_size`) in `summarise` in `dplyr`

You probably can use n() to get the number of rows for group

library(dplyr)
mtcars %>%
  group_by(cyl) %>%
  summarise(zz = sum(am)/n())

#    cyl    zz
#  <dbl> <dbl>
#1  4.00 0.727
#2  6.00 0.429
#3  8.00 0.143

dplyr - filter by group size

You can do it more concisely with n():

library(dplyr)
dat %>% group_by(cat) %>% filter(n() == 5)

dplyr repetition within % % operator

To get the same result as cbind, we can use do. As @DavidArenburg mentioned, summarise output a single value/row per each group combination whereas using mutate we get the output with the same number of rows. But, here we are doing a different operation which can be done within the do environment. In the code . signifies the dataset. If we want to extract the 'id' column from dt4, we can either use dt4$id or dt4[['id']]. Replace the dt4 with ..

library(dplyr)
dt4 %>% 
    group_by(id) %>%
    do(data.frame(id=.$id, v1=rep(.$dayweek, .$n)))
#Source: local data frame [63 x 2]
#Groups: id

#  id       v1
#1   1   Friday
#2   1   Friday
#3   1   Friday
#4   1   Monday
#5   1   Monday
#6   1   Monday
#7   1 Saturday
#8   1 Saturday
#9   1 Saturday
#10  1   Sunday
#.. ..      ...

Or another option based on @Frank's comments would be to specify the row index generated from rep inside slice and select the columns that we need to keep.

dt4 %>%
     slice(rep(1:n(),n)) %>%
     select(-n)

When to use Do function in dplyr

The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:

Differences between using do and not using it

Within the context of data frames, the key differences between using do and not using do are:

No automatic insertion of dot The code within the do will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014))) code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014))) with a dot since the dot is not automatically inserted. This is a consequence of do being the right hand side function of %>% rather than summarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) } would also not automatically insert dot as the first argument of the myfun call.
respecting group_by Only functions specifically written to respect group_by will do so. There are two issues here. (1) Only functions specifically written to respect group_by will be run once for each group. mutate, summarize and do are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by. On the other hand (ii) within do dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do is used and all 6 rows in the second where it is not. This is despite the fact that summarize respects group_by in that it runs once per group.
```
BOD$g <- c(1, 1, 1, 2, 2, 2)
BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
## # A tibble: 2 x 2
## # Groups: g [2]
##       g    nr
##   <dbl> <int>
## 1  1.00     3
## 2  2.00     3

BOD %>% group_by(g) %>% summarize(nr = nrow(.))
## # A tibble: 2 x 2
##       g    nr
##   <dbl> <int>
## 1  1.00     6
## 2  2.00     6
```

See ?do for more information.

Code from Question

Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.

mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)

mydata %>% 
       filter(Index %in% c("A", "C", "I")) %>% 
       group_by(Index) %>% 
       do(head(., 2))

## # A tibble: 6 x 2
## # Groups: Index [3]
##   Index  Y2014
##   <fctr> <dbl>
## 1 A       1.00
## 2 A       1.00
## 3 C       1.00
## 4 C       1.00
## 5 I       1.00
## 6 I       1.00

The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)

The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:

mydata %>% 
       group_by(Index) %>% 
       do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not found

If we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.

mydata %>% 
       group_by(Index) %>% 
       summarise(Mean_2014 = mean(Y2014))

## # A tibble: 3 x 2
##   Index  Mean_2014
##   <fctr>     <dbl>
## 1 A           1.00
## 2 C           1.00
## 3 I           1.00

Create a column in R to compare values within a group and flag as greater than (1), less than (0) or equal (2)

df %>%
  group_by(Round) %>%
  mutate( Flag1 = replace(rank(Score) - 1, length(unique(Score)) == 1, 2))

  Round Team  Score  Flag Flag1
  <int> <chr> <int> <int> <dbl>
1     1 Team1     4     0     0
2     1 Team2     8     1     1
3     2 Team1     9     1     1
4     2 Team2     2     0     0
5     3 Team1     6     2     2
6     3 Team2     6     2     2
7     4 Team1    14     1     1
8     4 Team2     9     0     0

Sample from a data frame using group-specific sample sizes

The easiest thing I can think of is a map2 solution using purrr.

library(dplyr)
library(purrr)

df %>% 
  group_split(group) %>% 
  map2_dfr(c(4, 5), ~ slice_sample(.x, n = .y))

# A tibble: 9 x 2
  group   value
  <chr>   <dbl>
1 A     -0.687 
2 A      1.56  
3 A      0.0705
4 A      1.72  
5 B     -0.560 
6 B      0.461 
7 B      0.129 
8 B      0.0705
9 B     -0.230

A caution is that you need to understand the order of the split. I think group_split() will sort the group as factors. A way around that would be to adapt like this, and lookup the n from a named vector.

group_slice_n <- c(A = 4, B = 5)

df %>% 
  split(.$group) %>% 
  imap_dfr(~ slice_sample(.x, n = group_slice_n[.y]))

R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups

How about making use of nest instead:

foo %>%
    group_by(fac) %>%
    nest() %>%
    mutate(mahal = map(data, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(.x, use = "pairwise.complete.obs")))) %>%
    unnest()
## A tibble: 10 x 4
#   fac   mahal      x       y
#   <fct> <dbl>  <dbl>   <dbl>
# 1 A     1.02   -6.26  15.1
# 2 A     0.120   1.84   3.90
# 3 A     2.81   -8.36  -6.21
# 4 A     2.84   16.0  -22.1
# 5 A     1.21    3.30  11.2
# 6 B     2.15   -8.20  -0.449
# 7 B     2.86    4.87  -0.162
# 8 B     1.23    7.38   9.44
# 9 B     0.675   5.76   8.21
#10 B     1.08   -3.05   5.94

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.

Update

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

Generate sample data with one additional column.

set.seed(1)
foo <- data.frame(
    x = rnorm(10, 0, 10),
    y = rnorm(10, 0, 10),
    z = rnorm(10, 0, 10),
    fac = c(rep("A", 5), rep("B", 5)))

We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a list
```
cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
```
So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

Now for the actual purrr command:

imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
    mutate_all(function(lst) map(lst, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(., use = "pairwise.complete.obs")))) %>%
    unnest() %>%
    bind_cols(foo, .)
#           x           y           z fac     cols1     cols2
#1  -6.264538  15.1178117   9.1897737   A 1.0197542 1.3608052
#2   1.836433   3.8984324   7.8213630   A 0.1199607 1.1141352
#3  -8.356286  -6.2124058   0.7456498   A 2.8059562 1.5099574
#4  15.952808 -22.1469989 -19.8935170   A 2.8401953 3.0675228
#5   3.295078  11.2493092   6.1982575   A 1.2141337 0.9475794
#6  -8.204684  -0.4493361  -0.5612874   B 2.1517055 1.2284793
#7   4.874291  -0.1619026  -1.5579551   B 2.8626501 1.1724828
#8   7.383247   9.4383621 -14.7075238   B 1.2271316 2.5723023
#9   5.757814   8.2122120  -4.7815006   B 0.6746788 0.6939081
#10 -3.053884   5.9390132   4.1794156   B 1.0838341 2.3328276

In short, we

loop over entries in cols,
nest data in foo per fac based on columns defined in cols,
apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
finally unnest the distance data and column-bind it to the original foo data.

Grouped Operations That Result in Length Not Equal to 1 or Length of Group in Dplyr