Using Dplyr for Frequency Counts of Interactions, Must Include Zero Counts

Using dplyr for frequency counts of interactions, must include zero counts

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
# id date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6: John12 2006-08-03 1
# 7: John12 2007-09-11 0
# 8: John12 2008-06-12 0
# 9: John12 2008-10-11 0
#10: John12 2009-07-03 0
#11: Lisa825 2006-08-03 0
#12: Lisa825 2007-09-11 0
#13: Lisa825 2008-06-12 0
#14: Lisa825 2008-10-11 0
#15: Lisa825 2009-07-03 1
#16: Tom2993 2006-08-03 0
#17: Tom2993 2007-09-11 0
#18: Tom2993 2008-06-12 1
#19: Tom2993 2008-10-11 1
#20: Tom2993 2009-07-03 0

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.

Relative frequencies / proportions with dplyr

Try this:

mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))

# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.

The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

Is there a way to show the zero-counts by using dplyr on sample data?

You can do a left join:

library(dplyr)

numbofrunsperside %>%
left_join(
sampledata_hit_counts,
by = c("StartPos", "Direction"),
suffix = c("_runs", "_hits")
) %>%
mutate(
p_test = ifelse(is.na(n_hits), 0, n_hits) / n_runs
) %>%
pull(p_test)
#[1] 0.2000000 0.0000000 0.0000000 0.1666667 0.0000000 0.0000000 0.3333333 0.1428571 0.0000000 0.1250000 0.1666667 0.5000000 0.2000000
#[14] 0.4000000 0.1666667 0.0000000 0.0000000 0.3333333 0.5000000 0.0000000

Create data of frequency of interactions between variables using R

Using data.table, you can probably do something like:

library(data.table)

#convert into data.table
setDT(B1)

#create interaction between animals in the same location & month
ans <- B1[, if (.N > 1L) transpose(combn(unique(Animal), 2L, simplify=FALSE)),
by=.(Location, Month)]

#change column names to desired column names
setnames(ans, paste0("V", 1L:2L), paste0("Animal", 1L:2L))

#sort animals so that A, B and B, A are the same
ans[, paste0("Animal", 1L:2L) := .(pmin(Animal1, Animal2), pmax(Animal1, Animal2))]

#count the number of interactions as requested
ans[, .(NumInteract=.N), by=c(paste0("Animal", 1L:2L))]

output:

   Animal1 Animal2 NumInteract
1: A B 1
2: A D 1
3: B D 3
4: C D 2
5: A C 1
6: D E 1
7: B C 1

count frequency by year with dplyr (conditional count)

Here is another tidyverse method. Simply speaking, we would pivot the dataframe from wide to long and then summarize. Frist summarization gets rid of all the other non-"A"s. Second summarization condenses the result table into unique bins identified by each toolA and produces a count.

library(dplyr)
library(tidyr)

df %>%
mutate(value = +(Tool == "A")) %>%
pivot_wider(names_from = Year, values_fill = 0L) %>%
group_by(ID) %>%
summarize(across(-Tool, sum)) %>%
group_by(toolA = rowSums(across(-ID))) %>%
summarize(count = n(), across(-c(ID, count), sum))

Output

# A tibble: 4 x 5
toolA count `2000` `2001` `2002`
<dbl> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 2 1 0 1
3 2 1 0 1 1
4 3 1 1 1 1

Using R - frequency counts with variable binwidths and factors

The following snippet should do what you want:

I loaded your sample into df.

library("dplyr")
df %>% group_by(sample.type, leaf.side, canopy, treatment) %>%
dplyr::select(Feret) %>%
do(data.frame(table(cut(.$Feret, breaks=bins, include.lowest=T))))

I refer you to the dplyr documentation. In short, x %>% f is f(x) and x -> f(a) is f(x,a).

Note that dplyr::select is just select, but I have had namespace issue so many times that now I always specify the package.

table(cut(df$Feret, breaks=bins)) is just a nicer way to do what you did with hist. With cut, you create a factor variable (Remember to add include.lowest=T if your values can reach the lower bound) and with table, you count the frequency of each level.

This gives:

   sample.type leaf.side canopy treatment        Var1 Freq
1 flower upper top green (0.01,0.03] 0
2 flower upper top green (0.03,0.1] 6
3 flower upper top green (0.1,0.3] 1
4 flower upper top green (0.3,1] 0
5 flower upper top green (1,3] 1
6 flower upper top green (3,10] 3
7 flower upper top white (0.01,0.03] 4
8 flower upper top white (0.03,0.1] 4
9 flower upper top white (0.1,0.3] 0
10 flower upper top white (0.3,1] 0
11 flower upper top white (1,3] 0
12 flower upper top white (3,10] 3
13 leaf lower bottom white (0.01,0.03] 5
14 leaf lower bottom white (0.03,0.1] 4
15 leaf lower bottom white (0.1,0.3] 1
16 leaf lower bottom white (0.3,1] 1
17 leaf lower bottom white (1,3] 0
18 leaf lower bottom white (3,10] 0
19 leaf lower top grey (0.01,0.03] 10
20 leaf lower top grey (0.03,0.1] 1
21 leaf lower top grey (0.1,0.3] 0
22 leaf lower top grey (0.3,1] 0
23 leaf lower top grey (1,3] 0
24 leaf lower top grey (3,10] 0
25 leaf upper bottom white (0.01,0.03] 4
26 leaf upper bottom white (0.03,0.1] 6
27 leaf upper bottom white (0.1,0.3] 1
28 leaf upper bottom white (0.3,1] 0
29 leaf upper bottom white (1,3] 0
30 leaf upper bottom white (3,10] 0
31 leaf upper top blue (0.01,0.03] 10
32 leaf upper top blue (0.03,0.1] 0
33 leaf upper top blue (0.1,0.3] 0
34 leaf upper top blue (0.3,1] 0
35 leaf upper top blue (1,3] 1
36 leaf upper top blue (3,10] 0

(Actually, it doesn't print like this since this is a tbl, but you can use print.data.frame to print a tbl the old way.)

From here it should be straightforward to extract the info you want.

How to get frequency counts using column breaks by row?

One more solution based on base R rle

library(dplyr)
dat %>% group_by(name) %>%
summarise(ever_inv = length(with(rle(srvc_inv), lengths[values==1])))

# A tibble: 1 x 2
name ever_inv
<fct> <int>
1 Bob 2

Cross tabulation of co-occuring pairs of variables

Since the columns are binary 1 or 0, you can also do this by multiplyting the columns together, which will result in 1 only if both columns are equal to 1, then summing

out <- sapply(df, function(x) colSums(df*x))
diag(out) <- NA
out
# var.1 var.2 var.3 var.4
# var.1 NA 1 1 1
# var.2 1 NA 2 1
# var.3 1 2 NA 2
# var.4 1 1 2 NA

or using matrix multiplication

out <- t(df) %*% as.matrix(df)
diag(out) <- NA
out

# var.1 var.2 var.3 var.4
# var.1 NA 1 1 1
# var.2 1 NA 2 1
# var.3 1 2 NA 2
# var.4 1 1 2 NA


Related Topics



Leave a reply



Submit