﻿ How to Count the Number of Unique Values by Group - ITCodar

# How to Count the Number of Unique Values by Group

## Count unique values using pandas groupby

I think you can use `SeriesGroupBy.nunique`:

``print (df.groupby('param')['group'].nunique())parama    2b    1Name: group, dtype: int64``

Another solution with `unique`, then create new `df` by `DataFrame.from_records`, reshape to `Series` by `stack` and last `value_counts`:

``a = df[df.param.notnull()].groupby('group')['param'].unique()print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())a    2b    1dtype: int64``

## Add count of unique / distinct values by group to the original data

Using `ave` (since you ask for it specifically):

``within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})``

Make sure that `type` is character vector and not factor.

Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a `data.table` solution as well.

``require(data.table)setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+# if you don't want df to be modified by referenceans = as.data.table(df)[, count := uniqueN(type), by = color]``

`uniqueN` was implemented in `v1.9.6` and is a faster equivalent of `length(unique(.))`. In addition it also works with data.frames/data.tables.

Other solutions:

Using plyr:

``require(plyr)ddply(df, .(color), mutate, count = length(unique(type)))``

Using `aggregate`:

``agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))merge(df, agg, by="color", all=TRUE)``

## Count unique values per groups with Pandas

You need `nunique`:

``df = df.groupby('domain')['ID'].nunique()print (df)domain'facebook.com'    1'google.com'      1'twitter.com'     2'vk.com'          3Name: ID, dtype: int64``

If you need to `strip` `'` characters:

``df = df.ID.groupby([df.domain.str.strip("'")]).nunique()print (df)domainfacebook.com    1google.com      1twitter.com     2vk.com          3Name: ID, dtype: int64``

Or as Jon Clements commented:

``df.groupby(df.domain.str.strip("'"))['ID'].nunique()``

You can retain the column name like this:

``df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})print(df)    domain  ID0       fb   11      ggl   12  twitter   23       vk   3``

The difference is that `nunique()` returns a Series and `agg()` returns a DataFrame.

## How to count the number of unique values by group?

I think you've got it all wrong here. There is no need neither in `plyr` or `<-` when using `data.table`.

Recent versions of data.table, v >= 1.9.6, have a new function `uniqueN()` just for that.

``library(data.table) ## >= v1.9.6setDT(d)[, .(count = uniqueN(color)), by = ID]#    ID count# 1:  A     3# 2:  B     2``

If you want to create a new column with the counts, use the `:=` operator

``setDT(d)[, count := uniqueN(color), by = ID]``

Or with `dplyr` use the `n_distinct` function

``library(dplyr)d %>%  group_by(ID) %>%  summarise(count = n_distinct(color))# Source: local data table [2 x 2]# #   ID count# 1  A     3# 2  B     2``

Or (if you want a new column) use `mutate` instead of `summary`

``d %>%  group_by(ID) %>%  mutate(count = n_distinct(color))``

## Count distinct values depending on group

You would use `count(distinct)`:

``select "group", count(distinct id)from tgroup by "group";``

Note that `group` is a very poor name for a column because it is a SQL keyword. Hopefully the real column name is something more reasonable.

## Selecting COUNT(*) with DISTINCT

Count all the DISTINCT program names by program type and push number

``SELECT COUNT(DISTINCT program_name) AS Count,  program_type AS [Type] FROM cm_production WHERE push_number=@push_number GROUP BY program_type``

`DISTINCT COUNT(*)` will return a row for each unique count. What you want is `COUNT(DISTINCT <expression>)`: evaluates expression for each row in a group and returns the number of unique, non-null values.

## Count the number of unique values per group

Looks like you want `transform` + `nunique`;

``df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')        df   _a  _b  a_b_30   1   3      31   1   4      32   1   5      33   2   3      14   2   3      15   3   3      26   3   9      2``

This is effectively `groupby` + `nunique` + `map`:

``v = df.groupby('_a')['_b'].nunique()df['a_b_3'] = df['_a'].map(v)df   _a  _b  a_b_30   1   3      31   1   4      32   1   5      33   2   3      14   2   3      15   3   3      26   3   9      2``

## How to count number of unique groups missing information in a groupby?

To count unique IDs, check where it's null then `max` within [ID, method], to indicate any missing value within that [ID, method]. Then sum over the method to get the Number of unique IDS missing something.

``(df[['var_1', 'var_2']].isnull()    .groupby([df['ID'], df['method']]).max()    .sum(level='method')``

``        var_1  var_2method              AB          1      0CD          1      0BC          1      0DE          0      1``

## Python group by and count distinct values in a column and create delimited list

You can use `str.len` in your code:

``df3 = (df.groupby('company')['product']         .apply(lambda x: list(x.unique()))         .reset_index()         .assign(count=lambda d: d['product'].str.len())  ## added line      )``

output:

``     company            product  count0     Amazon           [E-comm]      11   Facebook     [Social Media]      12     Google  [Search, Android]      23  Microsoft        [OS, X-box]      2``

## R - Count unique/distinct values in two columns together per group

You can subset the data from `cur_data()` and `unlist` the data to get a vector. Use `n_distinct` to count number of unique values.

``library(dplyr)df %>%  group_by(ID) %>%  mutate(Count = n_distinct(unlist(select(cur_data(),                    Party, Party2013)), na.rm = TRUE)) %>%  ungroup#     ID  Wave Party Party2013 Count#  <int> <int> <chr> <chr>     <int>#1     1     1 A     A             2#2     1     2 A     NA            2#3     1     3 B     NA            2#4     1     4 B     NA            2#5     2     1 A     C             3#6     2     2 B     NA            3#7     2     3 B     NA            3#8     2     4 B     NA            3``

data

It is easier to help if you provide data in a reproducible format

``df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A", "B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))``