Cumulative Count of Unique Values in R

Cumulative count of unique values over time

Give the countries an ID number based on first appearance, and then the cumulative count is the same as the cumulative max of that ID:

mydf = mydf[order(mydf$Year, mydf$Country), ]
mydf$country_id = as.integer(factor(mydf$Country, levels = unique(mydf$Country)))
mydf$cum_n_country = cummax(mydf$country_id)

If years are repeated, you'll need to aggregate/summarize the max cum_n_country by year.

library(dplyr)
library(ggplot2)
mydf %>%
  group_by(Year) %>%
  summarize(cum_n_country = max(cum_n_country)) %>%
  ggplot(aes(x = Year, y = cum_n_country)) + 
  geom_line()

R: Calculating cumulative number of unique entries

Here's another solution with dplyr:

library(dplyr)

test %>%
  mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
  group_by(exp) %>%
  slice(n()) %>%
  select(-entries)

test %>%
  mutate(cum_unique_entries = cumsum(!duplicated(entries))) %>%
  group_by(exp) %>%
  summarise(cum_unique_entries = last(cum_unique_entries))

Result:

# A tibble: 4 x 2
     exp cum_unique_entries
  <fctr>              <int>
1   exp1                  4
2   exp2                  6
3   exp3                  7
4   exp4                  9

Note:

First find the cumulative sum of all non-duplicates (cumsum(!duplicated(entries))), group_by exp, and take the last cumsum of each group, this number would be the cumulative unique entries for each group.

Cumulative count of unique values per group

Another possibility using ave:

df$obs <- with(df, ave(elig_end_date, names,
                       FUN = function(x) cumsum(!duplicated(x))))

#    names date_of_claim elig_end_date obs
# 1    tom    2010-01-01    2010-07-01   1
# 2    tom    2010-05-04    2010-07-01   1
# 3    tom    2010-06-01    2014-01-01   2
# 4    tom    2010-10-10    2014-01-01   2
# 5   mary    2010-03-01    2014-06-14   1
# 6   mary    2010-05-01    2014-06-14   1
# 7   mary    2010-08-01    2014-06-14   1
# 8   mary    2010-11-01    2014-06-14   1
# 9   mary    2011-01-01    2014-06-14   1
# 10  john    2010-03-27    2011-03-01   1
# 11  john    2010-07-01    2011-03-01   1
# 12  john    2010-11-01    2011-03-01   1
# 13  john    2011-02-01    2011-03-01   1

Counting the cumulative sum of unique values in a vector

We can use count

library(tidyverse)
count(tibble(v1 = vector), v1) %>%
   pull(n)

Cumulative sum of unique values based on multiple criteria

This cound help, without the need for a join.

df %>% arrange(Country, Site, species, Year) %>% 
  filter(Year>1980) %>%
  group_by(Site, species) %>%
  mutate(nYear = length(unique(Year))) %>% 
  mutate(spsum = rowid(species))

# A tibble: 30 x 6
# Groups:   Site, species [5]
   Country Site  species  Year nYear spsum
   <chr>   <chr>   <int> <int> <int> <int>
 1 A       F           1  1981     6     1
 2 A       F           1  1986     6     2
 3 A       F           1  1991     6     3
 4 A       F           1  1996     6     4
 5 A       F           1  2001     6     5
 6 A       F           1  2006     6     6
 7 B       G           2  1982     6     1
 8 B       G           2  1987     6     2
 9 B       G           2  1992     6     3
10 B       G           2  1997     6     4
# ... with 20 more rows

Cumulative sum of unique events for each year

One dplyr option could be:

df %>%
    group_by(id) %>%
    mutate(cum_sum = cumsum(!duplicated(event))) %>%
    group_by(id, year) %>%
    summarise(cum_sum = max(cum_sum))

  id     year cum_sum
  <chr> <dbl>   <int>
1 1      1900       3
2 1      1901       3
3 1      1902       5
4 2      1900       1
5 2      1901       3
6 3      1900       1

Cumulative count of each value

The dplyr way:

library(dplyr)

foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo

# A tibble: 8 x 2
# Groups:   id [3]
     id count
  <dbl> <int>
1     1     1
2     2     1
3     3     1
4     2     2
5     2     3
6     1     2
7     2     4
8     3     2

That ends up grouped by id. If you want it not grouped, add %>% ungroup().

dplyr Running count of unique entries

This seems to give the result you are after

df %>%
  group_by(subjectID) %>% 
  mutate(
    n_tot = row_number(),
    n_case=cumsum(!duplicated(caseID))
  )

We use duplicated to see if the case ID is new or not, and then use cumsum() to get a running count of new cases.