In this final lesson we will apply tools to visualize and interpret the community structure of our networks.
Because link communities are more appealing for visual analysis (e.g., overlapping communities), we will focus on visualization and interpretation of these clusters, although in principle the same approaches could be applied to the spectral clusters as well.
If you haven’t done so already, please load the R libraries and data from the previous practical session:
# libraries
library(pheatmap)
library(dplyr)
library(reshape2)
library(igraph)
library(linkcomm)
library(kernlab)
# data
GITHUBDIR = "http://scalefreegan.github.io/Teaching/DataIntegration/data/"
load(url(paste(GITHUBDIR, "data.rda", sep = "")))
load(url(paste(GITHUBDIR, "kernel.rda", sep = "")))
load(url(paste(GITHUBDIR, "g_linkcomm.rda", sep = "")))
Link-communities can provide a richly detailed decomposition of a complex network into structured (highly-connected) communities. However, it can also be more complicated to interpret the meaning of these communities since a given node can belong to multiple communities. Here we use several tools from the linkcomm
R package to identify highly connected nodes and communities in network and visualize them. These basic tools will serve as a starting place to dive deeper into understanding the structure of these networks.
A first feature that one could assess on the network. Oftentimes, one may be interested in nodes that are highly “central” or influential in the network. In the case of link communities, these are the nodes that belong to highly-connected communities that in addition also weights how similar each community to which a node belongs is to the others. Formally,
\[ C_{c}(i) = \sum_{i \in j }^N \left(1-\frac{1}{m}\sum_{i \in j \cap k}^m S(j,k)\right) \]
\(N\) is the number of communities to which \(i\) belongs, \(S(j,k)\) is the similarity between community \(j\) and \(k\) (Jaccard coefficient for number of shared nodes), \(m\) is number of communities to which nodes \(j\) and \(i\) jointly belong.
Compute node centrality. Print the 5 nodes with the highest modularity
Node centrality is calculated with the function
getCommunityCentrality
What are the most central genes? Why are they so central in our networks?
cc <- getCommunityCentrality(g_linkcomm, type = "commweight" )
sort(cc, decreasing = T)[1:5]
ENSG00000132383 | ENSG00000049541 | ENSG00000095002 | ENSG00000116062 | ENSG00000163918 |
---|---|---|---|---|
60.75352 | 57.59503 | 57.22532 | 51.75811 | 47.34038 |
What are the nodes that are highly central? Perhaps the following provides some insight:
plot(cc, degree(g)[names(cc)], xlab = "Community Centrality", ylab = "Degree")
Compute community modularity. Print the 5 communities with the highest modularity
Community modularity is calculated with the function
getCommunityConnectedness
Why are values for two of the communities so much higher than the others?
cm <- getCommunityConnectedness(g_linkcomm, conn = "modularity")
sort(cm, decreasing = T)[1:5]
280 | 327 | 244 | 278 | 262 |
---|---|---|---|---|
0.556787 | 0.2585082 | 0.055786 | 0.0529202 | 0.0494077 |
The linkcomm
R package provides functionality for visually representing link-communities. In particular it allows one to visualize all communities to which a node jointly belongs using node-pies. In this case each node is represented by a pie chart that quantifies the fraction of a node’s edges belonging to each community.
Before plotting the whole network with this representation, which can be a bit bewildering, let’s focus on a simple subnetwork. To extract this subnetwork we will use another handy function, getClusterRelatedness
. This function is particularly useful in context of the multiscale nature of the link communities. In principle, our original edge similarity dendrogram could have been cut a multiple heights to generate communities at varying scales. The default algorithm settings chooses this point to be the height that maximizes the density of the result communities. However, we could be justified in merging these communities into larger meta-communities. Let’s do that to find a sub-network to plot.
Compute cluster relatedness. Cut the dendrogram using parameter
cutat = 1.5
to define metacommunitiesPlot the metacommunity of size 4 as a sub-network with
node.pies = T
.
cr = getClusterRelatedness(g_linkcomm)
c_meta = cutDendrogramAt(cr, cutat = 1.5)
c_meta_i = which(sapply(c_meta,length)==4)
plot(g_linkcomm, type = "graph", shownodesin = 0, node.pies = TRUE, vlabel.cex = 0.6, clusterids = c_meta[[c_meta_i]])
Plot the entire linkcommunity network with
node.pies = T
.How does this network compare to the previous sub-network?
plot(g_linkcomm, type = "graph", shownodesin = 0, node.pies = TRUE, vlabel=F)
For the remainder of the course, Cytoscape will be used to extend our analysis of the integrated kernel network and the communities derived from it by link-community detection.
Export the graph and link community annotations to Cytoscape. For the purpose of this exercise only keep the edges that are assigned to a community by link community clustering.
The community IDs can be accessing the
edges
attribute of the linkcomm object, e.g.g_linkcomm$edges
In the Solution below I have also written out the evidence codes for each edge.
mc_edge = as_edgelist(g)
mc_edge = data.frame(node1 = mc_edge[,1], node2 = mc_edge[,2], weight = E(g)$weight)
linkcomm_edgelist = g_linkcomm$edges
towrite = merge(mc_edge,linkcomm_edgelist,by = c("node1","node2"))
towrite = merge(towrite, data, by.x = c("node1","node2"), by.y = c("gene1","gene2"))
write.table(towrite, file = "graph.txt", sep = "\t", quote = F, col.names=T, row.names = F)
Before opening this network in Cytoscape, you might check to see whether the link community network (which is a sub-network of the entire kernel matrix) has an over-representation of one of the original sources of information (e.g., only contains evidences from protein-protein interactions)
Is the link community network biased for a particular kind of evidence?
Metric | Full Network (%) | Link Community Network (%) |
---|---|---|
BP | 0.0769166 | 0.075859 |
CODA80 | 0.0231004 | 0.0464079 |
HIPPO | 0.1211711 | 0.1075413 |
PI | 0.7052801 | 0.6684516 |
TM | 0.0735319 | 0.1017403 |
The network is now ready to import into Cytoscape.
Analyze the link-community network. Apply techniques you learned in Matt’s first lecture to extend our analysis of the networks.
Some things to try:
Layout the network with different algorithms
Color each edge according to its link community membership
Inspect edges of related clusters. Did they originate from similar sources of information?
Use BiNGO to compute GO enrichment for obvious subnetworks.
You can find a version I made here
Thanks for participating in the course.