What Are the Differences Between Community Detection Algorithms in Igraph

What are the differences between community detection algorithms in igraph?

Here is a short summary about the community detection algorithms currently implemented in igraph:

  • edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).

  • fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.

  • walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).

  • spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).

  • leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.

  • label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.

igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.

Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.

R: K Means Clustering vs Community Detection Algorithms (Weighted Correlation Network) - Have I overcomplicated this question?

(first some background to understand the nature of the problem from what you describe) You have 2 datasets and therefore produce 2 datastructures: Personal_Information and Relationship_Information. You have a set of entities which appear to be unique as there are no name repetitions in Personal_Information, therefore if you know that these entities have connectivity information between themselves we can refer to them as nodes in a network, where their interconnectivity can produce a network where there are communities that a community detection algorithm can uncover/allocate/detect. So,

  • Personal_Information, describes each person (node)
  • Relationship_Information, describes their connectivity/relationship (edges)

In the example usage of this information you supplied in your code, you appear to only be using the graph data which is built only from Personal_Information res.cor <- Personal_Information[, c(2:4)] %>% ... and not Relationship_Information. This means that you are building relationship between the variables of each person that is intrinsic to them as nodes in a network rather than what data they have produces as a result of their connected interactions. To understand what you are doing here, your direction is like saying; I am going to produce a network between the personality traits of people and ignore their associations between themselves even though I have the data. I am going to look at how those personality features correlate between themselves and then see which groups of feature values have values that follow each other (correlate in groups)

So finding the correlations between the features of nodes (persons) over multiple people is fine, and then producing a matrix of that information is also ok, and then producing a graph/network from that is also ok. The result of that graph you produce (you call graph), via fc <- fastgreedy.community(graph) is that what you obtain is; which groups of variables of each person are co-correlated. Eg, var1 and var2 have a strong correlation between themsevles, but var2 and var3 have a strong negative correlation between themselves, so the edge between var2 and var3 is going to push them to be in separate communities and also push var1 to be in a separate community from var3 as it is tied strongly to var2 (close friend). How is this information useful? It can help you understand how the variables exist as groups so that if you have a new person who had a low value of var2 and you don't know the value of var1 or var3; you'd expect that var1 will be low as well and that var3 is high. If you took the covariance of the person data, you could take the eigenvectors and effectively do PCA which gives you a vector with information of this nature.

But, this is not producing information regarding the network edges you observed/measured in your Relationship_Information data which describes the community data information and not the node data. This dataset looks like an adjacency list, which is a datastructure that lists the first 2 columns as the node source in col1, node destination in col2, and the edge weight in col3, and if you have the same names of nodes in col2 and col1 (swapped) with the same edge weight the network has symmetric edges (undirected), else it is directed. Since your data, has 2 edges columns (col3 and col4) you can either produce one network with col1,col2,col3 and another with col1,col2,col4, or... you can produce a network with

  • adj_list1 = col1,col2,(col3-col4), using var names in that spreadsheet $adj_list1 = name1,name2,(how_much_they_paid_each_other-how_much_they_owe_each_other)$
  • or adj_list1 = col1,col2,(col3/col4) $adj_list1 = name1,name2,(how_much_they_paid_each_other / how_much_they_owe_each_other)$

and that is up to you how you define an edge with those values. You want to produce a network from adj1 or adj2 and then from that network apply the community detection. Think of it like those payments in that dataset as interactions similar to those on social media as being likes and mentions connecting people together. The community results here show the labels for the communities that are economically tied according to which edges you use, and you can apply an algorithm like the Louvain algorithm to do so.

But this does not use both the nodes' data and the edge data (person data and exchange data). They are answering different questions.

Applying K-Means to the node feature data is answering a different question from the community detection algorithm.

  • K-Means, these variables values for each person are not evenly distributed, they are focused into K dense regions where the inbetween regions have sparse samples. So we have types
  • Community detection, ignoring what these people features are, lets group people together on their interactions to see how many groups there are so if people exchange money between themselves they are doing it with focus on a sub-group.

So those questions are independent using clustering and community detection as they use independently collected datasets. The spread sheets do not depend upon each other or does the data. This does not mean that they data has no crossing information. You can have those features affect the edges. So when presenting it, you have 2 separate investigations.

(Another answer above mentions fusion based approaches to analyzing the data with both the node data and edge data together, but it does not appear as if that is your question. are you trying to use both datasets together? If so, the easiest approach is to use an approach with good implementations out there and 'graph neural networks' like the SGC, simple graph convolutional neural network, is a good recommendation, although it sounds intimidating, you'd feed it the adjacency matrix made from the payment network you make and then the node attributes/features. Python's DGL library is great for this. You can do it unsupervised with scaled data if you want.)

meaning of weights in community detection algorithms

My answer is going to be based on the igraph package in R. The situation is indeed quite confusing and the questions are relevant since, as Newman (2004) says,

Since the publication of that work, the author has been asked a number
of times whether an appropriate generalization of the algorithm exists
for weighted networks.

In his paper he derives an appropriate generalization of the Newman-Girvan algorithm to weighted networks.

Weights

You are right about the interpretation of weights in the Newman-Girvan algorithm. edge_betweenness uses a formula analogous to that in (Brandes, 2001), where the length of a path is defined as the sum of the weights of its edges. (You may also check the source code but it's quite involved). In ?edge_betweenness and, in particular, ?cluster_edge_betweenness it says

Edge weights are used to calculate weighted edge betweenness. This
means that edges are interpreted as distances, not as connection
strengths.

The implications are as follows. Let b(e, w) be the edge betweenness of an edge e with weight w. Then it can be shown (I could elaborate if you wish) that

b(e, w) <= b(e, w*) if and only if w >= w*.

That is, the edge betweenness and the weight of e are inversely related. The main idea is that given, e.g., w* >> w, those shortest paths that were crossing e now are likely to be dominated by some other paths that do not include e. Hence, larger weight implies (weakly) lower betweenness, and lower betweenness makes it less likely that e will be recognized as an edge connecting two communities. Thus, this sounds strange if we see weights as distances. On the other hand, if e is within some community and we decrease its weight, then the number of shortest paths through that edge potentially increases and it becomes more likely to be seen as connecting two communities. I am not yet claiming anything about the corresponding modularity scores, though.

Now let's suppose that weights actually correspond to connection strengths. Then the stronger the connection is, the fewer shortest paths will go through that edge (because we still need to compute them), the lower its edge betweenness is, and the less likely it is to be removed. So that kind of makes sense.

What is not nice, or rather strange, is that now the length of a path is defined as the sum of its connection strengths. However, we can reinterpret the algorithm then. Suppose that the weights are >> 1 within communities and << 1 between them. Then we can interpret the length of a path as the privacy of this path (e.g., a path within a community would contain lots of close interactions, while the edge connecting two communities is somewhat public, open). Given such an interpretation, the algorithm would look for the least private / the most open paths and compute the corresponding betweenness. Then we would be removing such edges that belong to many most open paths.

So perhaps I made a mistake somewhere, but it looks like it would make more sense to see weights as connection strengths.

Newman (2004) does something related:

...we will consider specifically those networks in which the weights
on edges take greater values for vertex pairs that have closer
connections or are more similar in some way.

It would seem that it should make sense. However, as to keep the more natural definition of the shortest path he writes:

One can define paths on a weighted network by assuming the “length” of
an edge to vary inversely with its weight, so that two vertices that
are connected twice as strongly will be half as far apart.

That is, the shortest path lengths now are inversely related to the weights. Since not doing that seemed to give good results, now we have a problem:

To see this, notice that any two vertices that are particularly
strongly connected to one another will have a particularly short
distance along the edge between them. Geodesic paths will thus, all
other things being equal, prefer to flow along such an edge than along
another longer edge between two less well connected vertices, and
hence closely connected pairs will tend to attract a lot of paths and
acquire high betweenness. This means that, as a general rule, we are
more likely to remove edges between well connected pairs than we are
between poorly connected pairs, and this is the precise opposite of
what we would like the algorithm to do.

Which is the result that I described when we see weights as distances. As I mentioned in the beginning of the answer, to deal with this Newman (2004) proposes mapping weighted graphs to unweighted multigraphs and then proceeding very similarly as in the standard case. I believe that this multigraph idea can be implemented by setting weighted = NULL but having not a binary adjacency matrix (when defining a graph; see weighted in ?graph_from_adjacency_matrix).

Modularity

First of all, one can use modularity with weighted graphs, as Newman (2004) does, that's not a problem. In general, it's so not obvious how using weights affects using modularity as the way to choose the number of communities. I'll perhaps add some examples with R. It seems like there should be an improvement over the unweighted case, as Newman (2004) finds, when the interpretation is in line with the way algorithm works. Otherwise, I think the graph structure and the weights itself can be quite important to describe the degree of how far from the truth we get.

References

Newman, M.E., 2004. Analysis of weighted networks. Physical review E, 70(5).

Brandes, U., 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2), pp.163-177.

Community detection and comparison in igraph


  1. See the paper referenced in the documentation: http://igraph.sourceforge.net/doc/python/igraph.Graph-class.html#community_multilevel

  2. See http://igraph.sourceforge.net/doc/python/igraph.clustering-module.html#compare_communities

Different result of community detection function in igraph R and Python

The R interface is automatically using the weights from the weight attribute when it calculates the community structure and the modularity, while the Python interface doesn't. E.g., in Python:

>>> g=load("test.ncol", directed=False)
>>> g.multilevel_community().q
0.4974489795918367
>>> g.multilevel_community(weights=g.es["weight"]).q
0.2845496103894415

Community detection for large, directed graphs

You can use the Leiden algorithm from the Python package leidenalg, which should be even faster than the Louvain algorithm that you mention. This package works on a multitude of different networks, including directed networks, but also multiplex networks and networks with negative links. In addition, it supports a range of different quality functions. It should easily scale to networks with millions of node (provided it fits in memory of course), with runtimes usually of a couple of minutes at most.

Disclaimer: I am the author of the package (and a number of related publications).



Related Topics



Leave a reply



Submit