unique() for more than one variable
How about using unique()
itself?
df <- data.frame(yad = c("BARBIE", "BARBIE", "BAKUGAN", "BAKUGAN"),
per = c("AYLIK", "AYLIK", "2 AYLIK", "2 AYLIK"),
hmm = 1:4)
df
# yad per hmm
# 1 BARBIE AYLIK 1
# 2 BARBIE AYLIK 2
# 3 BAKUGAN 2 AYLIK 3
# 4 BAKUGAN 2 AYLIK 4
unique(df[c("yad", "per")])
# yad per
# 1 BARBIE AYLIK
# 3 BAKUGAN 2 AYLIK
Subset with unique cases, based on multiple columns
You can use the duplicated()
function to find the unique combinations:
> df[!duplicated(df[1:3]),]
v1 v2 v3 v4 v5
1 7 1 A 100 98
2 7 2 A 98 97
3 8 1 C NA 80
6 9 3 C 75 75
To get only the duplicates, you can check it in both directions:
> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
v1 v2 v3 v4 v5
3 8 1 C NA 80
4 8 1 C 78 75
5 8 1 C 50 62
Group by and count unique values in several columns in R
Here's an approach using dplyr::across
, which is a handy way to calculate across multiple columns:
my_data <- data.frame(
city = c(rep("A", 3), rep("B", 3)),
col1 = 1:6,
col2 = 0,
col3 = c(1:3, 4, 4, 4),
col4 = 1:2
)
library(dplyr)
my_data %>%
group_by(city) %>%
summarize(across(col1:col4, n_distinct))
# A tibble: 2 x 5
city col1 col2 col3 col4
* <chr> <int> <int> <int> <int>
1 A 3 1 3 2
2 B 3 1 1 2
R equivalent of SELECT DISTINCT on two or more fields/variables
unique
works on data.frame
so unique(df[c("var1","var2")])
should be what you want.
Another option is distinct
from dplyr
package:
df %>% distinct(var1, var2) # or distinct(df, var1, var2)
Note:
For older versions of dplyr (< 0.5.0, 2016-06-24) distinct
required additional step
df %>% select(var1, var2) %>% distinct
(or oldish way distinct(select(df, var1, var2))
).
Count of unique values that occur more than 100 in a data frame
This would solve the problem.
import pandas as pd
# sample dict with repeated items
d = {'drug_name':['hello', 'hello', 'hello', 'hello', 'bye', 'bye']}
df = pd.DataFrame(d)
print(df)
print()
# this gets the unique values with their respective frequency
df_counted = df['drug_name'].value_counts()
print(df_counted)
print()
# filter to values > 3
df_filtered = df_counted[df_counted>2]
print(df_filtered)
this is the sample dataframe:
drug_name
0 hello
1 hello
2 hello
3 bye
4 bye
These are the unique values counted:
hello 4
bye 2
These are the unique values > n:
hello 4
dplyr count unique values in two columns without reshaping long
An alternative is using c_across()
after dplyr 1.0.0
:
library(dplyr)
d %>%
group_by(Group) %>%
mutate(n = n_distinct(c_across(everything())))
# # A tibble: 6 x 4
# # Groups: Group [3]
# Group node1 node2 n
# <chr> <chr> <chr> <int>
# 1 A a w 2
# 2 B b r 3
# 3 B b t 3
# 4 C c z 4
# 5 C c u 4
# 6 C c i 4
Note: everything()
in c_across()
excludes grouping variables, i.e. Group
, so actually n_distinct()
takes c(node1, node2)
as input. To specify variables, you can also use
c_across(node1:node2)
c_across(starts_with('node'))
Related Topics
Using Cbind on an Arbitrarily Long List of Objects
Detach All Packages While Working in R
R: Assign Variable Labels of Data Frame Columns
If - Else If - Else Statement and Brackets
Filter Function in Dplyr Errors: Object 'Name' Not Found
Change Background and Text of Strips Associated to Multiple Panels in R/Lattice
Expand Spacing Between Tick Marks on X Axis
Euclidean Distance of Two Vectors
Delete "" from CSV Values and Change Column Names When Writing to a CSV
Export a Graph to .Eps File with R
Using Substitute to Get Argument Name
Do You Use Attach() or Call Variables by Name or Slicing
Removing the Border of Legend Symbol