Count unique values using pandas groupby
I think you can use SeriesGroupBy.nunique
:
print (df.groupby('param')['group'].nunique())
param
a 2
b 1
Name: group, dtype: int64
Another solution with unique
, then create new df
by DataFrame.from_records
, reshape to Series
by stack
and last value_counts
:
a = df[df.param.notnull()].groupby('group')['param'].unique()
print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())
a 2
b 1
dtype: int64
Count unique values per groups with Pandas
You need nunique
:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to strip
'
characters:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or as Jon Clements commented:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can retain the column name like this:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The difference is that nunique()
returns a Series and agg()
returns a DataFrame.
Add count of unique / distinct values by group to the original data
Using ave
(since you ask for it specifically):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
Other solutions:
Using plyr:
require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)
Python group by and count distinct values in a column and create delimited list
You can use str.len
in your code:
df3 = (df.groupby('company')['product']
.apply(lambda x: list(x.unique()))
.reset_index()
.assign(count=lambda d: d['product'].str.len()) ## added line
)
output:
company product count
0 Amazon [E-comm] 1
1 Facebook [Social Media] 1
2 Google [Search, Android] 2
3 Microsoft [OS, X-box] 2
Counting unique / distinct values by group in a data frame
This should do the trick:
ddply(myvec,~name,summarise,number_of_distinct_orders=length(unique(order_no)))
This requires package plyr.
R group_by and count distinct values in dataframe column with condition, using mutate
Since c
is unique, you can approach it from the other way - count the number of c
values that show up in val
.
df %>%
group_by(id) %>%
mutate(distinctValues = sum(c %in% val))
# # A tibble: 14 x 3
# # Groups: id [6]
# id val distinctValues
# <dbl> <dbl> <int>
# 1 1 100 0
# 2 1 100 0
# 3 2 200 1
# 4 2 300 1
# 5 3 400 0
# 6 4 500 1
# 7 4 500 1
# 8 5 500 1
# 9 5 600 1
# 10 5 600 1
# 11 6 200 2
# 12 6 200 2
# 13 6 300 2
# 14 6 500 2
You could also use distinctValues = sum(unique(val) %in% c)
if that seems clearer - it might be a tad less efficient, but not enough to matter unless your data is massive.
How to count the number of unique values by group?
I think you've got it all wrong here. There is no need neither in plyr
or <-
when using data.table
.
Recent versions of data.table, v >= 1.9.6, have a new function uniqueN()
just for that.
library(data.table) ## >= v1.9.6
setDT(d)[, .(count = uniqueN(color)), by = ID]
# ID count
# 1: A 3
# 2: B 2
If you want to create a new column with the counts, use the :=
operator
setDT(d)[, count := uniqueN(color), by = ID]
Or with dplyr
use the n_distinct
function
library(dplyr)
d %>%
group_by(ID) %>%
summarise(count = n_distinct(color))
# Source: local data table [2 x 2]
#
# ID count
# 1 A 3
# 2 B 2
Or (if you want a new column) use mutate
instead of summary
d %>%
group_by(ID) %>%
mutate(count = n_distinct(color))
R - Count unique/distinct values in two columns together per group
You can subset the data from cur_data()
and unlist
the data to get a vector. Use n_distinct
to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
pandas dataframe group by multiple columns and count distinct values
Combine value_counts
with apply
to do it per column:
df.apply(pd.value_counts)
Related Topics
Counting Unique Values Across Variables (Columns) in R
Ggplot With 2 Y Axes on Each Side and Different Scales
Select Rows from a Data Frame Based on Values in a Vector
Drop Data Frame Columns by Name
How to Split Data into Training/Testing Sets Using Sample Function
Make the Background of a Graph Different Colours in Different Regions
Create Counter With Multiple Variables
Splitting a Dataframe String Column into Multiple Different Columns
Merge 2 Data Frames in a Loop for Each Column in One of Them
Numbering Rows Within Groups in a Data Frame
Combine Two Data Frames by Rows (Rbind) When They Have Different Sets of Columns
How to Disable Scientific Notation
How to Create a Lag Variable Within Each Group