Unique() for More Than One Variable

unique() for more than one variable

How about using unique() itself?

df <- data.frame(yad = c("BARBIE", "BARBIE", "BAKUGAN", "BAKUGAN"),
                 per = c("AYLIK",  "AYLIK",  "2 AYLIK", "2 AYLIK"),
                 hmm = 1:4)

df
#       yad     per hmm
# 1  BARBIE   AYLIK   1
# 2  BARBIE   AYLIK   2
# 3 BAKUGAN 2 AYLIK   3
# 4 BAKUGAN 2 AYLIK   4

unique(df[c("yad", "per")])
#       yad     per
# 1  BARBIE   AYLIK
# 3 BAKUGAN 2 AYLIK

Subset with unique cases, based on multiple columns

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
  v1 v2 v3  v4 v5
1  7  1  A 100 98
2  7  2  A  98 97
3  8  1  C  NA 80
6  9  3  C  75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
  v1 v2 v3 v4 v5
3  8  1  C NA 80
4  8  1  C 78 75
5  8  1  C 50 62

Group by and count unique values in several columns in R

Here's an approach using dplyr::across, which is a handy way to calculate across multiple columns:

my_data <- data.frame(
  city = c(rep("A", 3), rep("B", 3)),
  col1 = 1:6,
  col2 = 0,
  col3 = c(1:3, 4, 4, 4),
  col4 = 1:2
)

library(dplyr)
my_data %>%
  group_by(city) %>%
  summarize(across(col1:col4, n_distinct))

# A tibble: 2 x 5
  city   col1  col2  col3  col4
* <chr> <int> <int> <int> <int>
1 A         3     1     3     2
2 B         3     1     1     2

R equivalent of SELECT DISTINCT on two or more fields/variables

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

Count of unique values that occur more than 100 in a data frame

This would solve the problem.


import pandas as pd

# sample dict with repeated items
d = {'drug_name':['hello', 'hello', 'hello', 'hello', 'bye', 'bye']}
df = pd.DataFrame(d)
print(df)
print()

# this gets the unique values with their respective frequency
df_counted = df['drug_name'].value_counts()
print(df_counted)
print()


# filter to values > 3
df_filtered = df_counted[df_counted>2]
print(df_filtered)

this is the sample dataframe:

  drug_name
0     hello
1     hello
2     hello
3       bye
4       bye

These are the unique values counted:

hello    4
bye      2

These are the unique values > n:

hello    4

dplyr count unique values in two columns without reshaping long

An alternative is using c_across() after dplyr 1.0.0:

library(dplyr)

d %>%
  group_by(Group) %>%
  mutate(n = n_distinct(c_across(everything())))

# # A tibble: 6 x 4
# # Groups:   Group [3]
#   Group node1 node2     n
#   <chr> <chr> <chr> <int>
# 1 A     a     w         2
# 2 B     b     r         3
# 3 B     b     t         3
# 4 C     c     z         4
# 5 C     c     u         4
# 6 C     c     i         4

Note: everything() in c_across() excludes grouping variables, i.e. Group, so actually n_distinct() takes c(node1, node2) as input. To specify variables, you can also use

c_across(node1:node2)
c_across(starts_with('node'))

Unique() for More Than One Variable