Compare each element in groupby() group to the unique values in that group and get the location of equality
Use factorize
in GroupBy.transform
:
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
print (df)
subject date order order1
0 A 01.01.2020 1 1
1 A 01.01.2020 1 1
2 A 02.01.2020 2 2
3 B 01.01.2020 1 1
4 B 02.01.2020 2 2
5 B 02.01.2020 2 2
Or you can use GroupBy.rank
, but is necessary convert column date
to datetimes:
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1
0 A 2020-01-01 1 1.0
1 A 2020-01-01 1 1.0
2 A 2020-02-01 2 2.0
3 B 2020-01-01 1 1.0
4 B 2020-02-01 2 2.0
5 B 2020-02-01 2 2.0
Difference of solution is if changed order of datetimes:
print (df)
subject date order (disregarding temporal order of date)
0 A 2020-01-01 1
1 A 2020-03-01 2 <- changed datetime for sample
2 A 2020-02-01 3
3 B 2020-01-01 1
4 B 2020-02-01 2
5 B 2020-02-01 2
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1 order2
0 A 2020-01-01 1 1 1.0
1 A 2020-03-01 1 2 3.0
2 A 2020-02-01 2 3 2.0
3 B 2020-01-01 1 1 1.0
4 B 2020-02-01 2 2 2.0
5 B 2020-02-01 2 2 2.0
In summary: use the first method if you don't care about the temporal order of date
being reflected in the order
output, or the second method if the temporal order matters and should reflect in the order
output.
How to get unique values from multiple columns in a pandas groupby
You can do it with apply
:
import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))
Count unique values using pandas groupby
I think you can use SeriesGroupBy.nunique
:
print (df.groupby('param')['group'].nunique())
param
a 2
b 1
Name: group, dtype: int64
Another solution with unique
, then create new df
by DataFrame.from_records
, reshape to Series
by stack
and last value_counts
:
a = df[df.param.notnull()].groupby('group')['param'].unique()
print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())
a 2
b 1
dtype: int64
How do i get only the new unique values per group?
IIUC, you can use
(~df['user'].duplicated()).groupby(df['Month']).sum()
Demo:
>>> df
Month user
0 2 Michael
1 2 Michael
2 3 Lea
3 3 Michael
>>> (~df['user'].duplicated()).groupby(df['Month']).sum()
Month
2 1
3 1
I'm assuming that the 'Month'
column is sorted, otherwise the duplicated
trick won't work.
edit: your exact output can be produced with
(~df['user'].duplicated()).groupby(df['Month']).sum().reset_index().rename({'user': 'Unique_Count_New_Users'}, axis=1)
Group values by unique elements
First of all, (I assume) this is your vector
a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")
As per possible solutions, here are few (can't find a good dupe right now)
as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Also rle
will work the similarly in your specific scenario
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or (which is practically the same)
data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector
a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")
And the results of the 4 different solutions:
1.
as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5
The factor
solution begins with 2
because a
is unsorted and hence the first values are getting higher integer
representation within the factor
function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.
2.
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5
This cumsum/duplicated
solution got confused because of "B110"
already been present at the beginning and hence grouped "D440","D440","B110","B110"
into the same group.
3.
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5
This match/unique
solution added ones at the end, because it is sensitive to "B110"
showing up in more than one sequences (because of unique
) and hence grouping them all into same group regardless of where they appear
4.
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6
This solution only cares about sequences, hence different sequences of "B110"
were grouped into different groups
GroupBy and count the unique elements in a List
var list = new List<string> { "Foo1", "Foo2", "Foo3", "Foo2", "Foo3", "Foo3", "Foo1", "Foo1" };
var grouped = list
.GroupBy(s => s)
.Select(group => new { Word = group.Key, Count = group.Count() });
how to group rows in an unique row for unique column values?
We could group by 'a', 'c', summarise
the unique
elements to 'b' in a string
library(dplyr)
df %>%
group_by(a, c) %>%
summarise(b = sprintf('[%s]', toString(unique(b))), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 3 x 3
# a b c
# <chr> <chr> <dbl>
#1 A1 [a, b, c] 1
#2 A2 [d, e] 1
#3 A3 [f] 1
Or if the 'c' values are also changing, use across
df %>%
group_by(a) %>%
summarise(across(everything(), ~ sprintf('[%s]',
toString(unique(.)))), .groups = 'drop')
Or if we need a list
df %>%
group_by(a) %>%
summarise(across(everything(), ~ list(unique(.))
), .groups = 'drop')
Or using glue
df %>%
group_by(a, c) %>%
summarise(b = glue::glue('[{toString(unique(b))}]'), .groups = 'drop')
-output
# A tibble: 3 x 3
# a c b
#* <chr> <dbl> <glue>
#1 A1 1 [a, b, c]
#2 A2 1 [d, e]
#3 A3 1 [f]
Python group by and count distinct values in a column and create delimited list
You can use str.len
in your code:
df3 = (df.groupby('company')['product']
.apply(lambda x: list(x.unique()))
.reset_index()
.assign(count=lambda d: d['product'].str.len()) ## added line
)
output:
company product count
0 Amazon [E-comm] 1
1 Facebook [Social Media] 1
2 Google [Search, Android] 2
3 Microsoft [OS, X-box] 2
Related Topics
Add a Constant Value to All Rows in a Dataframe
Cannot Install Library(Xlsx) in R and Look for an Alternative
"Non-Finite Function Value" When Using Integrate() in R
R: Split String into Numeric and Return the Mean as a New Column in a Data Frame
Calculate Difference Between Dates by Group in R
How to Prevent Blogdown from Rerendering All Posts
How to Format the X-Axis of the Hard Coded Plotting Function of Spei Package in R
Programmatically Create Tab and Plot in Markdown
Separate a Column into Multiple Columns Using Tidyr::Separate with Sep=""
Plot a Function with Several Arguments in R
Creating "Word" Cloud of Phrases, Not Individual Words in R
Solve Homogenous System Ax = 0 for Any M * N Matrix a in R (Find Null Space Basis for A)