List Distinct Values in a Vector in R

List distinct values in a vector in R

Do you mean unique:

R> x = c(1,1,2,3,4,4,4)
R> x
[1] 1 1 2 3 4 4 4
R> unique(x)
[1] 1 2 3 4

Count number of distinct values in a vector

Here are a few ideas, all points towards your solution already being very fast. length(unique(x)) is what I would have used as well:

x <- sample.int(25, 1000, TRUE)

library(microbenchmark)
microbenchmark(length(unique(x)),
nlevels(factor(x)),
length(table(x)),
sum(!duplicated(x)))
# Unit: microseconds
# expr min lq median uq max neval
# length(unique(x)) 24.810 25.9005 27.1350 28.8605 48.854 100
# nlevels(factor(x)) 367.646 371.6185 380.2025 411.8625 1347.343 100
# length(table(x)) 505.035 511.3080 530.9490 575.0880 1685.454 100
# sum(!duplicated(x)) 24.030 25.7955 27.4275 30.0295 70.446 100

Unique values of a dataframe column to list in R

In base R, you can get unique values from site_name and make it a list.

as.list(unique(site_df$site_name))

For example, with default mtcars this will result into :

as.list(unique(mtcars$cyl))
#[[1]]
#[1] 6

#[[2]]
#[1] 4

#[[3]]
#[1] 8

Get indexes of unique values in a vector

We can use split

split(seq_along(filenames), filenames)

#$kisyu2_mst.csv
#[1] 1 3

#$kisyu3_mst.csv
#[1] 2 4 5

R: Find unique vectors in list of vectors

We can sort the list elements, apply duplicated to get a logical index of unique elements and subset the list based on that

list_of_vectors[!duplicated(lapply(list_of_vectors, sort))]
#[[1]]
#[1] "a" "b" "c"

#[[2]]
#[1] "b" "b" "c"

#[[3]]
#[1] "c" "c" "b"

#[[4]]
#[1] "b" "b" "c" "d"

#[[5]]
#NULL

Create a list column of unique values from other columns in a grouped tibble

I think the following code might do what you ask. The trick is to combine the values from different columns, some of them containing characters, and the other one containing "lists". So the first steps are to extract the information from both types of columns as vectors (with c_across(starts_with("c")) and unlist(extra)) ads combine them into a vector on which you will be able to work.

exemplar %>%
group_by(group) %>%
mutate(unique = list( # Makes sure that the new column is a "list"
unique( # Get the "unique" values
c( # Combine results from two types of columns
c_across( # First extract the "char" columns into a vector
starts_with("c")),
unlist(extra)) # Then extract the "extra" column into a vector
)
)) %>%
ungroup()

The result of this command is the following

# A tibble: 10 × 6
group char1 char2 char3 extra unique
<chr> <chr> <chr> <chr> <list> <list>
1 group1 a b c <chr [1]> <chr [8]>
2 group1 b c d <chr [1]> <chr [8]>
3 group1 c d e <chr [1]> <chr [8]>
4 group1 d e f <chr [1]> <chr [8]>
5 group1 e f g <chr [1]> <chr [8]>
6 group2 f g h <list [2]> <chr [10]>
7 group2 g h i <chr [1]> <chr [10]>
8 group2 h i j <chr [1]> <chr [10]>
9 group2 i j k <chr [1]> <chr [10]>
10 group2 j k l <chr [1]> <chr [10]>

For group 1, the result is

[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" ""

And for group 1,

[[1]]
[1] "f" "g" "h" "i" "j" "k" "l" "x" "y" ""

finding unique values from a list

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6


Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

First your version, then my double unique approach:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

Here is a larger example with some timings of all three methods:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE),
ncol = 1000)))

Here are results for the two approaches using DF:

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
user system elapsed
12.884 0.077 12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
user system elapsed
0.648 0.000 0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
user system elapsed
0.510 0.000 0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

Test if a value is unique in a vector in R

The match operator %in% is very helpful:

!test %in% test[duplicated(test)]


Related Topics



Leave a reply



Submit