Unique() for More Than One Variable

unique() for more than one variable

How about using unique() itself?

df <- data.frame(yad = c("BARBIE", "BARBIE", "BAKUGAN", "BAKUGAN"),
per = c("AYLIK", "AYLIK", "2 AYLIK", "2 AYLIK"),
hmm = 1:4)

df
# yad per hmm
# 1 BARBIE AYLIK 1
# 2 BARBIE AYLIK 2
# 3 BAKUGAN 2 AYLIK 3
# 4 BAKUGAN 2 AYLIK 4

unique(df[c("yad", "per")])
# yad per
# 1 BARBIE AYLIK
# 3 BAKUGAN 2 AYLIK

Subset with unique cases, based on multiple columns

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
6 9 3 C 75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
v1 v2 v3 v4 v5
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62

Group by and count unique values in several columns in R

Here's an approach using dplyr::across, which is a handy way to calculate across multiple columns:

my_data <- data.frame(
city = c(rep("A", 3), rep("B", 3)),
col1 = 1:6,
col2 = 0,
col3 = c(1:3, 4, 4, 4),
col4 = 1:2
)

library(dplyr)
my_data %>%
group_by(city) %>%
summarize(across(col1:col4, n_distinct))

# A tibble: 2 x 5
city col1 col2 col3 col4
* <chr> <int> <int> <int> <int>
1 A 3 1 3 2
2 B 3 1 1 2

R equivalent of SELECT DISTINCT on two or more fields/variables

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

Count of unique values that occur more than 100 in a data frame

This would solve the problem.


import pandas as pd

# sample dict with repeated items
d = {'drug_name':['hello', 'hello', 'hello', 'hello', 'bye', 'bye']}
df = pd.DataFrame(d)
print(df)
print()

# this gets the unique values with their respective frequency
df_counted = df['drug_name'].value_counts()
print(df_counted)
print()


# filter to values > 3
df_filtered = df_counted[df_counted>2]
print(df_filtered)


this is the sample dataframe:

  drug_name
0 hello
1 hello
2 hello
3 bye
4 bye

These are the unique values counted:

hello    4
bye 2

These are the unique values > n:

hello    4

dplyr count unique values in two columns without reshaping long

An alternative is using c_across() after dplyr 1.0.0:

library(dplyr)

d %>%
group_by(Group) %>%
mutate(n = n_distinct(c_across(everything())))

# # A tibble: 6 x 4
# # Groups: Group [3]
# Group node1 node2 n
# <chr> <chr> <chr> <int>
# 1 A a w 2
# 2 B b r 3
# 3 B b t 3
# 4 C c z 4
# 5 C c u 4
# 6 C c i 4

Note: everything() in c_across() excludes grouping variables, i.e. Group, so actually n_distinct() takes c(node1, node2) as input. To specify variables, you can also use

  • c_across(node1:node2)
  • c_across(starts_with('node'))


Related Topics



Leave a reply



Submit