Find Most Common Elements in Array with a Group By

Find most common elements in array with a group by

Quick and dirty:

SELECT group_name, color, count(*) AS ct
FROM (
SELECT group_name, unnest(favorite_colors) AS color
FROM tbl
) sub
GROUP BY 1,2
ORDER BY 1,3 DESC;

Better with a LATERAL JOIN

In Postgres 9.3 or later this is the cleaner form:

SELECT group_name, color, count(*) AS ct
FROM tbl t, unnest(t.favorite_colors) AS color
GROUP BY 1,2
ORDER BY 1,3 DESC;

The above is shorthand for

...
FROM tbl t
JOIN LATERAL unnest(t.favorite_colors) AS color ON TRUE
...

And like with any other INNER JOIN, it would exclude rows without color (favorite_colors IS NULL) - as did the first query.

To include such rows in the result, use instead:

SELECT group_name, color, count(*) AS ct
FROM tbl t
LEFT JOIN LATERAL unnest(t.favorite_colors) AS color ON TRUE
GROUP BY 1,2
ORDER BY 1,3 DESC;

You can easily aggregate the "most common" colors per group in the next step, but you'd need to define "most common colors" first ...

Most common colors

As per comment, pick colors with > 3 occurrences.

SELECT t.group_name, color, count(*) AS ct
FROM tbl t, unnest(t.favorite_colors) AS color
GROUP BY 1,2
HAVING count(*) > 3
ORDER BY 1,3 DESC;

To aggregate the top colors in an array (in descending order):

SELECT group_name, array_agg(color) AS top_colors
FROM (
SELECT group_name, color
FROM tbl t, unnest(t.favorite_colors) AS color
GROUP BY 1,2
HAVING count(*) > 3
ORDER BY 1, count(*) DESC
) sub
GROUP BY 1;

-> SQLfiddle demonstrating all.

Finding most common elements of column of arrays in Presto

You can use to unnest() to flatten Array and then group by to group the unique values.

The Query to generate the data set for your case. You can replace this part with your select command in the final query:

with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)

O/P:

------
dt
------
[A,B,C]
------
[A,B]
------
[A,D]

Now in the final query we will first flatten this data to remove all the values from all the rows and then group those value to get unique values and their count.

FINAL QUERY:

with da AS( 
with dataset AS (
SELECT ARRAY[
ARRAY['A','B','C'],
ARRAY['A','B'],
ARRAY['A','D']
] AS data
)
select dt from dataset
CROSS JOIN UNNEST(data) AS t(dt)
)
select daVal,count(*) from da
CROSS JOIN UNNEST(dt) AS t(daVal)
GROUP BY daVal

Select last common element in arrays

Here is one approach using conditional aggregation.

select max(x.pid) last_common_id
from commits c
cross join lateral unnest(c.parent_commit_ids) with ordinality x(pid, rn)
where c.id in (1, 2)
group by x.rn
having max(x.pid) filter(where c.id = 1) = max(x.pid) filter(where c.id = 2)
order by rn desc limit 1

This assumes that you want to compare commits 1 and 2. The idea is to unnest each array, group by the element index, and then use the having clause to filter matching values. You can then sort and select the top match.

Demo on DB Fiddle

Sample data (taken from the comments of the question):


id | parent_commit_ids
-: | :----------------
1 | {1,4,6,8}
2 | {1,4,5}

Results:


| last_common_id |
| -------------: |
| 4 |

Side note: commit is a SQL keyword, hence not a good choice for a table name; I use commits instead.

Algorithm to find most common elements in different arrays

I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.

If so, then here is a Python solution:

def find_best_run (first_array, *argv):
# initialize data structures.
this_array_best_run = {}
for x in first_array:
this_array_best_run[x] = (1, (1,), (x,))

for this_array in argv:
# find the best runs ending at each value in this_array
last_array_best_run = this_array_best_run
this_array_best_run = {}

for x in this_array:
for (y, pattern) in last_array_best_run.iteritems():
(distinct_count, lengths, elements) = pattern
if x == y:
lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
else :
distinct_count += 1
lengths = tuple(lengths + (1,))
elements = tuple(elements + (x,))

if x not in this_array_best_run:
this_array_best_run[x] = (distinct_count, lengths, elements)
else:
(prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
if distinct_count < prev_count or prev_lengths < lengths:
this_array_best_run[x] = (distinct_count, lengths, elements)

# find the best overall run
best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
if distinct_count < best_count:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
elif distinct_count == best_count and best_lengths < lengths:
best_count = distinct_count
best_lengths = lengths
best_elements = elements

# convert it into a more normal representation.
answer = []
for (length, element) in zip(best_lengths, elements):
answer.extend([element] * length)

return answer

# example
print find_best_run(
[1,4,8,10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30]) # prints [4, 4, 4, 30]

Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.

If you have N arrays of length M, this should take O(N*M*M) time to run.

How to choose the most common value in a group related to other group in R?

Another dplyr strategy using count and slice:

library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
     ID VAR   CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG

How to group by common element in array?

Include graphframes (the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX with Spark version and YYY with Scala version:

spark.jars.packages  graphframes:graphframes:0.5.0-sparkXXX-s_YYY

Add explode keys:

import org.apache.spark.sql.functions._

val df = Seq(
(Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
(Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")

val edges = df.select(
explode($"key") as "src", $"value" as "dst")

Convert to graphframe:

import org.graphframes._

val gf = GraphFrame.fromEdges(edges)

Set checkpoint directory (if not set):

import org.apache.spark.sql.SparkSession

val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)

Find connected components:

val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run

Join result with input data:

 val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))

Check result:

result.show

// +-----+------------+--------+
// |value| group| key|
// +-----+------------+--------+
// | v3|489626271744|[k3, k2]|
// | v2|489626271744| [k2]|
// | v4|532575944704| [k4]|
// | v1|489626271744|[k1, k2]|
// +-----+------------+--------+

Finding the most frequent/common element in a collection?

I have to say that:

list.groupBy(identity).mapValues(_.size).maxBy(_._2)._1

Or just:

list.groupBy(identity).maxBy(_._2.size)._1

Doesn't really seem like that much work to me.

If you're worried about the overhead of building up the lists for each value when you only need counts, you could do the following:

list.foldLeft(Map.empty[Int, Int].withDefaultValue(0)) {
case (m, v) => m.updated(v, m(v) + 1)
}.maxBy(_._2)._1

Or even keep track of the maximum as you go, to avoid the extra traversal at the end:

list.foldLeft(
Map.empty[Int, Int].withDefaultValue(0), -1 -> Double.NegativeInfinity
) {
case ((m, (maxV, maxCount)), v) =>
val count = m(v) + 1
if (count > maxCount) (m.updated(v, count), v -> count)
else (m.updated(v, count), maxV -> maxCount)
}._2._1

This is obviously much less readable than the one-liners above, though, so I'd recommend sticking with them unless you can show (i.e., with benchmarking, not speculation) that they're a bottleneck in your application.



Related Topics



Leave a reply



Submit