Count unique values using pandas groupby
I think you can use SeriesGroupBy.nunique
:
print (df.groupby('param')['group'].nunique())
param
a 2
b 1
Name: group, dtype: int64
Another solution with unique
, then create new df
by DataFrame.from_records
, reshape to Series
by stack
and last value_counts
:
a = df[df.param.notnull()].groupby('group')['param'].unique()
print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())
a 2
b 1
dtype: int64
Add count of unique / distinct values by group to the original data
Using ave
(since you ask for it specifically):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
Other solutions:
Using plyr:
require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)
Count unique values per groups with Pandas
You need nunique
:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to strip
'
characters:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or as Jon Clements commented:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can retain the column name like this:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The difference is that nunique()
returns a Series and agg()
returns a DataFrame.
How to count the number of unique values by group?
I think you've got it all wrong here. There is no need neither in plyr
or <-
when using data.table
.
Recent versions of data.table, v >= 1.9.6, have a new function uniqueN()
just for that.
library(data.table) ## >= v1.9.6
setDT(d)[, .(count = uniqueN(color)), by = ID]
# ID count
# 1: A 3
# 2: B 2
If you want to create a new column with the counts, use the :=
operator
setDT(d)[, count := uniqueN(color), by = ID]
Or with dplyr
use the n_distinct
function
library(dplyr)
d %>%
group_by(ID) %>%
summarise(count = n_distinct(color))
# Source: local data table [2 x 2]
#
# ID count
# 1 A 3
# 2 B 2
Or (if you want a new column) use mutate
instead of summary
d %>%
group_by(ID) %>%
mutate(count = n_distinct(color))
Count distinct values depending on group
You would use count(distinct)
:
select "group", count(distinct id)
from t
group by "group";
Note that group
is a very poor name for a column because it is a SQL keyword. Hopefully the real column name is something more reasonable.
Selecting COUNT(*) with DISTINCT
Count all the DISTINCT program names by program type and push number
SELECT COUNT(DISTINCT program_name) AS Count,
program_type AS [Type]
FROM cm_production
WHERE push_number=@push_number
GROUP BY program_type
DISTINCT COUNT(*)
will return a row for each unique count. What you want is COUNT(DISTINCT <expression>)
: evaluates expression for each row in a group and returns the number of unique, non-null values.
Count the number of unique values per group
Looks like you want transform
+ nunique
;
df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
This is effectively groupby
+ nunique
+ map
:
v = df.groupby('_a')['_b'].nunique()
df['a_b_3'] = df['_a'].map(v)
df
_a _b a_b_3
0 1 3 3
1 1 4 3
2 1 5 3
3 2 3 1
4 2 3 1
5 3 3 2
6 3 9 2
How to count number of unique groups missing information in a groupby?
To count unique IDs, check where it's null then max
within [ID, method], to indicate any missing value within that [ID, method]. Then sum over the method to get the Number of unique IDS missing something.
(df[['var_1', 'var_2']].isnull()
.groupby([df['ID'], df['method']]).max()
.sum(level='method')
var_1 var_2
method
AB 1 0
CD 1 0
BC 1 0
DE 0 1
Python group by and count distinct values in a column and create delimited list
You can use str.len
in your code:
df3 = (df.groupby('company')['product']
.apply(lambda x: list(x.unique()))
.reset_index()
.assign(count=lambda d: d['product'].str.len()) ## added line
)
output:
company product count
0 Amazon [E-comm] 1
1 Facebook [Social Media] 1
2 Google [Search, Android] 2
3 Microsoft [OS, X-box] 2
R - Count unique/distinct values in two columns together per group
You can subset the data from cur_data()
and unlist
the data to get a vector. Use n_distinct
to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
Related Topics
Converting Data Frame into a List of Lists in R
Conditionally Replace Values of Subset of Rows With Column Name in R Using Only Tidy
I Want to Split Street Address into Two Columns. One With Street Number Other With Street Name
Remove Ids With Fewer Than 9 Unique Observations
Remove Space Between Plotted Data and the Axes
Multi-Row X-Axis Labels in Ggplot Line Chart
How to Force a Line Break in Rmarkdown'S Title
Remove Last N Rows in Data Frame With the Arbitrary Number of Rows
R: How to Get the Percentage Change from Two Different Columns
Filter a Data Frame According to Minimum and Maximum Values
Selecting Only Duplicates Based on Multiple Columns in R
Convert Multiple Columns of Numeric Data to Dates in R
R: How to Check If All Columns in a Data.Frame Are the Same
Combine Two Data Frames by Rows (Rbind) When They Have Different Sets of Columns
Left Align Two Graph Edges (Ggplot)
Error in Confusionmatrix the Data and Reference Factors Must Have the Same Number of Levels