How to Get the Most Frequent Level of a Categorical Variable in R

How to get the most frequent level of a categorical variable in R

a <- sample(x = c(19,   71,   98,  139,  146,  185,  191), size = 1000, replace = TRUE)
tt <- table(a)
names(tt[which.max(tt)])

How to get the most frequent level of a categorical variable in R when the variable has two level?

Assuming it's ok to infer this from a table, a simple frequency table will do it:

table1 <- table(dataset$sex, dataset$var2)
table1

Obviously substitute in your dataset's name and whatever you've called your second variable. The output will be a frequency table and you can easily read along each row to see the most frequent category for each sex.

Getting the most frequent element in a factor in R

Depending on the size of your data and the frequency at which you need to do such an exercise, you might want to spend some time writing a more efficient function. Underlying table is tabulate, which is much faster, and can thus lead to a function like the following:

MaxTable <- function(InVec, mult = FALSE) {
if (!is.factor(InVec)) InVec <- factor(InVec)
A <- tabulate(InVec)
if (isTRUE(mult)) {
levels(InVec)[A == max(A)]
}
else levels(InVec)[which.max(A)]
}

This function is designed to also identify when there are multiple values for the max values. Compare the following:

mySet <- c("A", "A", "A", "B", "B", "B", "C", "C")
## Your question indicates that you have factors,
## but your sample code is a character vector
mySetF <- factor(mySet) ## Just as an example

## @BrodieG's answer
fun1 <- function(InVec) {
names(which.max(table(InVec)))
}

## @sgibb's answer
fun2 <- function(InVec) {
m <- which.max(table(as.character(InVec)))
as.character(InVec)[m]
}

fun1(mySet)
# [1] "A"
fun2(mySet)
# [1] "A"
MaxTable(mySet)
# [1] "A"
MaxTable(mySet, mult = TRUE)
# [1] "A" "B"

library(microbenchmark)
microbenchmark(fun1(mySet), fun2(mySet), MaxTable(mySet), MaxTable(mySetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(mySet) 291.457 297.1845 302.2080 313.1235 3008.108 100
# fun2(mySet) 296.388 302.0775 311.3170 321.5260 1367.137 100
# MaxTable(mySet) 172.463 180.8755 184.8355 189.9700 1947.700 100
# MaxTable(mySetF) 34.510 38.1545 44.6045 46.6695 95.341 100

At the small vector level, this function is more efficient. This is even more obvious with factor vectors. How about with bigger vectors?

set.seed(1)
medSet <- sample(c(LETTERS, letters), 1e5, TRUE)
medSetF <- factor(medSet)

fun1(medSet)
# [1] "E"
fun2(medSet) ### Wrong Answer!!!
# [1] "D"
MaxTable(medSet)
# [1] "E"

microbenchmark(fun1(medSet), MaxTable(medSet), MaxTable(medSetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(medSet) 14222.846 14350.957 14484.4490 14600.490 34810.174 100
# MaxTable(medSet) 7787.761 7860.248 7917.3455 8019.068 9762.884 100
# MaxTable(medSetF) 501.733 529.257 570.0735 587.936 1469.994 100

I've dropped @sgibb's function from the benchmarks (it runs in about the same time as fun1()) since it returns the wrong answer.

One last benchmark....

set.seed(3)
bigSet <- sample(c(LETTERS, letters), 1e7, TRUE)
bigSetF <- factor(bigSet)
microbenchmark(fun1(bigSet), MaxTable(bigSet), MaxTable(bigSetF), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(bigSet) 1519.37503 1612.10290 1648.36473 1789.02965 1932.41073 10
# MaxTable(bigSet) 782.01856 791.86408 834.35764 894.60535 1019.28747 10
# MaxTable(bigSetF) 48.56459 48.76492 49.25444 49.93911 50.20404 10

Fastest way of determining most frequent factor in a grouped data frame in dplyr

Here's another option with dplyr:

set.seed(123)
z <- data.frame(a = rep(1:50000,100),
b = sample(LETTERS, 5000000, replace = TRUE),
stringsAsFactors = FALSE)

a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$b
b <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c

We make sure these are equivalent approaches:

> identical(a, b)
#[1] TRUE

Update

As per mentioned by @docendodiscimus, you could also do:

count(z, a, b) %>% slice(which.max(n))

Here are the results on the benchmark:

library(microbenchmark)
mbm <- microbenchmark(
steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1),
phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])),
docendo = count(z, a, b) %>% slice(which.max(n)),
times = 10
)

Sample Image

#Unit: seconds
# expr min lq mean median uq max neval cld
# steven 4.752168 4.789564 4.815986 4.813686 4.847964 4.875109 10 b
# phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690 10 c
# docendo 4.586096 4.611401 4.669375 4.688420 4.702352 4.753583 10 a

Create a variable capturing the most frequent occurence by group

You can do this using ddply and a custom function to pick out the most frequent value:

myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}

ddply(df1,.(id),.fun=myFun)

Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.

Impute most frequent categorical value in all columns in data frame

We could create an index for non-numeric columns

i1 <- !sapply(df, is.numeric)

Create a function for Mode

Mode <- function(x) { 
ux <- sort(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}

and replace the NAs in character columns with the most frequent value

df[i1] <- lapply(df[i1], function(x)
replace(x, is.na(x), Mode(x[!is.na(x)])))

Get most frequently occurring factor level in dplyr piping structure

Use table to count the items and then use which.max to find out the most frequent one:

df %>%
group_by(cat) %>%
mutate(cat_mode = names(which.max(table(num)))) %>%
head()

# A tibble: 6 x 3
# Groups: cat [4]
# cat num cat_mode
# <fctr> <dbl> <chr>
#1 Q 305 138
#2 W 34.0 212
#3 R 53.0 53
#4 D 395 5
#5 W 212 212
#6 Q 417 138
# ...

Return most frequent string value for each group

The key is to start grouping by both a and b to compute the frequencies and then take only the most frequent per group of a, for example like this:

df %>% 
count(a, b) %>%
slice(which.max(n))

Source: local data frame [2 x 3]
Groups: a

a b n
1 1 B 2
2 2 B 2

Of course there are other approaches, so this is only one possible "key".

Find the n most common values in a vector

I'm sure this is a duplicate, but the answer is simple:

sort(table(variable),decreasing=TRUE)[1:3]


Related Topics



Leave a reply



Submit