How to Count the Number of Unique Values by Group

Count unique values using pandas groupby

I think you can use SeriesGroupBy.nunique:

print (df.groupby('param')['group'].nunique())
param
a 2
b 1
Name: group, dtype: int64

Another solution with unique, then create new df by DataFrame.from_records, reshape to Series by stack and last value_counts:

a = df[df.param.notnull()].groupby('group')['param'].unique()
print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())
a 2
b 1
dtype: int64

Add count of unique / distinct values by group to the original data

Using ave (since you ask for it specifically):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})

Make sure that type is character vector and not factor.


Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.


Other solutions:

Using plyr:

require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))

Using aggregate:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)

Count unique values per groups with Pandas

You need nunique:

df = df.groupby('domain')['ID'].nunique()

print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64

If you need to strip ' characters:

df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64

Or as Jon Clements commented:

df.groupby(df.domain.str.strip("'"))['ID'].nunique()

You can retain the column name like this:

df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3

The difference is that nunique() returns a Series and agg() returns a DataFrame.

How to count the number of unique values by group?

I think you've got it all wrong here. There is no need neither in plyr or <- when using data.table.

Recent versions of data.table, v >= 1.9.6, have a new function uniqueN() just for that.

library(data.table) ## >= v1.9.6
setDT(d)[, .(count = uniqueN(color)), by = ID]
# ID count
# 1: A 3
# 2: B 2

If you want to create a new column with the counts, use the := operator

setDT(d)[, count := uniqueN(color), by = ID]

Or with dplyr use the n_distinct function

library(dplyr)
d %>%
group_by(ID) %>%
summarise(count = n_distinct(color))
# Source: local data table [2 x 2]
#
# ID count
# 1 A 3
# 2 B 2

Or (if you want a new column) use mutate instead of summary

d %>%
group_by(ID) %>%
mutate(count = n_distinct(color))

Count distinct values depending on group

You would use count(distinct):

select "group", count(distinct id)
from t
group by "group";

Note that group is a very poor name for a column because it is a SQL keyword. Hopefully the real column name is something more reasonable.

Selecting COUNT(*) with DISTINCT

Count all the DISTINCT program names by program type and push number

SELECT COUNT(DISTINCT program_name) AS Count,
program_type AS [Type]
FROM cm_production
WHERE push_number=@push_number
GROUP BY program_type

DISTINCT COUNT(*) will return a row for each unique count. What you want is COUNT(DISTINCT <expression>): evaluates expression for each row in a group and returns the number of unique, non-null values.

Count the number of unique values per group

Looks like you want transform + nunique;

df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')        
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2

This is effectively groupby + nunique + map:

v = df.groupby('_a')['_b'].nunique()
df['a_b_3'] = df['_a'].map(v)

df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2

How to count number of unique groups missing information in a groupby?

To count unique IDs, check where it's null then max within [ID, method], to indicate any missing value within that [ID, method]. Then sum over the method to get the Number of unique IDS missing something.

(df[['var_1', 'var_2']].isnull()
.groupby([df['ID'], df['method']]).max()
.sum(level='method')


        var_1  var_2
method
AB 1 0
CD 1 0
BC 1 0
DE 0 1

Python group by and count distinct values in a column and create delimited list

You can use str.len in your code:

df3 = (df.groupby('company')['product']
.apply(lambda x: list(x.unique()))
.reset_index()
.assign(count=lambda d: d['product'].str.len()) ## added line
)

output:

     company            product  count
0 Amazon [E-comm] 1
1 Facebook [Social Media] 1
2 Google [Search, Android] 2
3 Microsoft [OS, X-box] 2

R - Count unique/distinct values in two columns together per group

You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.

library(dplyr)

df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup


# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))


Related Topics



Leave a reply



Submit