Return Most Frequent String Value for Each Group

Return most frequent string value for each group

The key is to start grouping by both a and b to compute the frequencies and then take only the most frequent per group of a, for example like this:

df %>% 
count(a, b) %>%
slice(which.max(n))

Source: local data frame [2 x 3]
Groups: a

a b n
1 1 B 2
2 2 B 2

Of course there are other approaches, so this is only one possible "key".

Return most frequent string by group in base R

Continuing from docendo discimus's answer:

library(dplyr)
# library(tidyr)
df %>%
count(a, b) %>%
group_by(a) %>%
filter(n == max(n)) %>%
mutate(r = row_number()) %>%
tidyr::spread(r, b) %>%
select(-n)
# # A tibble: 3 x 3
# # Groups: a [3]
# a `1` `2`
# <fct> <fct> <fct>
# 1 1 A <NA>
# 2 2 B <NA>
# 3 3 A B

And then you just need to rename the columns.

Base R variant:

reshape(do.call(rbind.data.frame, by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})), timevar = "r", idvar = "a", direction = "wide")
# a b.1 b.2
# 1 1 A <NA>
# 2 2 B <NA>
# 3.1 3 A B

I'll break it down, since not all of it may be intuitive:

The by function returns a list (specially formatted, but still just a list). If we look at a single instance of a, let's explore what happens. I'll skip to a == "3", since that's the one with repeats:

by(df, df$a, function(x) { browser(); 1; })
# Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
Called from: FUN(data[x, , drop = FALSE], ...)
# Browse[1]>
debug at #1: [1] 1
# Browse[2]>
x
# a b
# 3 3 A
# 6 3 B
# 9 3 A
# 12 3 B
# Browse[2]>
( tb <- table(x$b) )
# A B
# 2 2

Alright, so we now have the count per-b. Realize that there might easily have been more here, say:

# A B C
# 2 2 1

so I'm going to reduce this named vector to just those with the highest value:

# Browse[2]> 
( tb <- tb[ tb == max(tb) ] ) # no change here, but had there been a third value in 'b' ...
# A B
# 2 2

Lastly, we want by to capture a data.frame (that we can later combine). We're guaranteed that a is one value potentially repeated, so a[1]; we have ensured that names(tb) has all "interesting" values, and the r is a helper for reshape, later:

# Browse[2]> 
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
# a b r
# 1 3 A 1
# 2 3 B 2

Now that we explored internally, let's wrap that up.

by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
})
# df$a: 1
# a b r
# 1 1 A 1
# ------------------------------------------------------------
# df$a: 2
# a b r
# 1 2 B 1
# ------------------------------------------------------------
# df$a: 3
# a b r
# 1 3 A 1
# 2 3 B 2

This looks awkward, but if you look under the hood (with dput), you'll see it's just a re-classed list. We can now combine them into a single frame with:

do.call(rbind.data.frame, by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb))
}))
# a b r
# 1 1 A 1
# 2 2 B 1
# 3.1 3 A 1
# 3.2 3 B 2

BTW: for both data.frame and rbind.data.frame, these are by default giving you factors. If you don't want them, then:

do.call(rbind.data.frame, c(by(df, df$a, function(x) {
tb <- table(x$b)
tb <- tb[ tb == max(tb) ]
data.frame(a = x$a[1], b = names(tb), r = seq_along(tb),
stringsAsFactors = FALSE)
}), stringsAsFactors=FALSE))
# a b r
# 1 1 A 1
# 2 2 B 1
# 3.1 3 A 1
# 3.2 3 B 2

And then the reshaping. I admit that this is the most fragile (at least for me) part of it. I'm not a reshape-user, I tend towards tidyr::spread or data.table::dcast, but this is base-R and works for now. The use of reshape is a tutorial in and of itself, so I won't go into it here. There are numerous attempts to provide more-user-friendly reshaping tools out there (reshape2, tidyr, data.table all come to mind up front but are unlikely to be the only ones).

How can I select the most frequent text value in each group?

A pivot table could be used, with a limit of 1 row in the Choice field so as to see the "Top 1" value for each Country:

1) Pivot Table

1.1) Fields

  • Row Dimension #1: Country
  • Row Dimension #2: Choice
  • Metric: Record Count

data_tab_fields

1.2) Sorting

Row #1 (sorts Country, alphabetically):

  • Field: Country
  • Order: Descending
  • Number of rows: Auto

Row #2 (sorts by COUNT, from highest to lowest):

  • Field: Record Count
  • Order: Descending
  • Number of rows: 1

data_tab_sorting

2) Filter

There were 2 NULL values in the Country field, which can be hidden using the filter:

Excludes `Country` Is NULL

filter_excludes_country_is_null

3) Hide Metric Column

  • One way to hide the metric column from viewers is to draw a shape such as a rectangle, over the respective area and then match the colour to that of the background (which is white in this case).
  • Another approach (used below) is to simply reduce the width of the pivot table, starting from the side of the metric (right); this method ensures that the scroll bar is also visible to users

Additional Notes

The default fields at the data source are:

default_fields_at_data_source

The GIF uses the field names in the question, which uses shortened versions of the names used in the data set (this was done by renaming the respective fields at the data source, so as to keep the original names in the data set as is):

  • What country are you from? was renamed to Country

  • Choose one was renamed to Choice

    renamed_fields_at_data_source

  • Also, The Record Count field is an auto generated field created in certain connectors such as Google Sheets and BigQuery, which serves the function of COUNT(Field)

Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

GIF

Finding the most frequent value of string using BigQuery

Below is for BigQuery Standard SQL

#standardSQL
SELECT
User_ID,
ARRAY_AGG(Language ORDER BY cnt DESC LIMIT 1)[OFFSET(0)] most_frequent_language
FROM (
SELECT
User_ID,
Language,
COUNT(*) AS cnt
FROM `project.dataset.language`
WHERE Language IS NOT NULL
GROUP BY User_ID, Language
)
GROUP BY User_ID

Most frequent value (mode) by group

Building on Davids comments your solution is the following:

Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
df %>% group_by(a) %>% mutate(c=Mode(b))

Notice though that for the tie when df$a is 3 then the mode for b is 1.

SQL Group-By most frequent

You can use a correlated subquery:

select distinct t1.PERSON, (
select ATE
from myTable t2
where t2.PERSON = t1.PERSON
group by ATE
order by count(*) desc
limit 1
) as ATE
from myTable t1

If you have ties, this query will pick one of the most eaten items "randomly".

With MySQL 8 or MariaDB 10.2 (both not stable yet) you will be able to use CTE (Common Table Expression)

with t1 as (
select PERSON, ATE, count(*) as cnt
from myTable
group by PERSON, ATE
), t2 as (
select PERSON, max(cnt) as cnt
from t1
group by PERSON
)
select *
from t1
natural join t2

On ties this query may return multiple rows per group (PERSON).

Find most frequent value in SQL column

SELECT
<column_name>,
COUNT(<column_name>) AS `value_occurrence`

FROM
<my_table>

GROUP BY
<column_name>

ORDER BY
`value_occurrence` DESC

LIMIT 1;

Replace <column_name> and <my_table>. Increase 1 if you want to see the N most common values of the column.



Related Topics



Leave a reply



Submit