At the end of the previous session, we processed Jean-Karim’s kernels, combining them into a single graph. In this session, we will apply two clustering techniques to the integrated data: spectral clustering and link-community clustering.
As you will recall from the lecture, spectral clustering is a technique for clustering based on transformation by dimensionality reduction.
Before applying spectral clustering to our data, we will derive a simple algorithm for spectral clustering. This will allow us to see more clearly how spectral clustering relates to k-means. We will once again explore the spirals
data used during the first tutorial.
The algorithm we will generate is similar to that described by Ng et al. On Spectral Clustering: Analysis and an algorithm. The R
code was obtained and adapted from João Neto
If it is not already loaded, please load the spirals dataset, which is included in the kernlab
package.
# libraries
library(pheatmap)
library(dplyr)
library(reshape2)
library(igraph)
library(linkcomm)
library(kernlab)
# data
data(spirals)
To remind you, spectral clustering appeared to provide a more desired result when clustering data sets that are separated poorly by standard k-means. Below I plot the data and the clustering result for both algorithms.
In what follows, we will write an algorithm to perform spectral clustering. We will see - perhaps surprisingly - that spectral clustering boils down to performing k-means on eigenvectors of a similarity kernel applied to the original data.
To help you write your code, João Neto has provided several pre-rolled functions, including:
s <- function(x1, x2, alpha=1) {
exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}
make.affinity <- function(S, n.neighboors=2) {
N <- length(S[,1])
if (n.neighboors >= N) { # fully connected
A <- S
} else {
A <- matrix(rep(0,N^2), ncol=N)
for(i in 1:N) { # for each line
# only connect to those points with larger similarity
best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
for (s in best.similarities) {
j <- which(S[i,] == s)
A[i,j] <- S[i,j]
A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
}
}
}
A
}
Your algorithm should include the following steps:
spiral
datasetWe will perform these steps one at a time. If you find yourself spending more than 10-15 minutes on this section, please read the solution and move on to the following activities.
- Compute the similarity matrix, S, between all points in the
spiral
dataset
make.similarity <- function(my.data, similarity) {
N <- nrow(my.data)
S <- matrix(rep(NA,N^2), ncol=N)
for(i in 1:N) {
for(j in 1:N) {
if (i!=j) {
S[i,j] <- similarity(my.data[i,], my.data[j,])
} else {
S[i,j] <- 0
}
}
}
S
}
S <- make.similarity(spirals, s)
- Calculate the affinity matrix, A, from S by applying k-nearest neighbors algorithm
A <- make.affinity(S, 3) # use 3 neighbors (includes self)
- Compute the weighted degree diagonal matrix, D, from A by summing across each row
D <- diag(apply(A, 1, sum))
- Calculate the unnormalized Laplacian, U, by subtracting A from D
U <- D - A
- Compute eigenvectors and eigenvalues of U.
evL <- eigen(U, symmetric=TRUE)
- Perform k-means clustering on matrix, Z, consisting of eigenvectors for k smallest eigenvalues, ignoring the smallest (constant) eigenvector
k <- 2
Z <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
km <- kmeans(Z, centers=k, nstart=5)
Plot the eigenvector matrix (Z) as well as the original data Are the two spiral clusters correctly identified? What does the data look like in the transformed space?
par(mfrow=c(1,2))
plot(Z, col=km$cluster, pch = 19, xlab = "Eigenvector 1", ylab = "Eigenvector 2")
plot(spirals, col=km$cluster, pch = 19, xlab = "", ylab = "")
You may have noticed a parameter or two lurking around, for example the n.neighboors
parameter in the k-nearest neighbors function. These parameters would need to be adjusted depending on the data or estimated automatically. In addition, it should be noted that there are several implementations of spectral clustering, using any of a number of different laplacians - you have just implemented a relatively simple version for the sake of clarity. For more detailed information, please refer to Ng et al. On Spectral Clustering: Analysis and an algorithm
Load the combined kernel and graph produced in the preceding practical
GITHUBDIR = "http://scalefreegan.github.io/Teaching/DataIntegration/data/"
load(url(paste(GITHUBDIR, "kernel.rda", sep = "")))
Before clustering the full dataset - which can be hard to visualize and evaluate - we will apply spectral clustering on a representative subset of the graph so that we can visualize its behavior.
I have selected 4 vertices with fairly high degree (degree = 20) from the matrix, mc, and selected all of their first neighbors. The sub-matrix is called, mc2.
degree = apply(mc, 1, function(i)sum(i > 0))
n_i = which(degree == 20)[1:4]
n = unique(c(n_i, which(mc[n_i, ] > 0, arr.ind = T)[, 2]))
mc2 = mc[n, n]
Make an igraph network from the sub-matrix mc2.
Plot the graph. How many clusters do you expect based on this visualization?
g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
g_layout = layout.fruchterman.reingold(g_small)
plot.igraph(g_small, vertex.label = NA, layout = g_layout)
Perform spectral clustering on the sub-matrix mc2. Remember that the matrix is a kernel. Therefore, you should pass it to corresponding functions using the function
as.kernelMatrix()
Spectral clustering can be performed with the function
specc
from thekernlab
package. Choose the number of clusters (centers) to be equal to the number you estimated above.Make an annotation data.frame, v_small, that contains a single column denoting the cluster membership of each gene. The values should be encoded as
factors
. The rownames of this matrix should correspond to the rownames of mc2.
mc2_specc = specc(as.kernelMatrix(mc2), centers = 3)
v_small = as.data.frame(factor(mc2_specc))
rownames(v_small) = rownames(mc2)
Make an
igraph
network from the kernel matrix, mc2.Plot the network, coloring each node by cluster membership as contained in the data.frame, v.
How do the clusters correspond to those you predicted before clustering?
g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
plot.igraph(g_small,vertex.label=NA,vertex.color=v_small[names(V(g_small)), ], layout = g_layout)
Plot a heatmap representation of the kernel matrix, mc2. Annotate each column (gene) with its corresponding cluster membership. A nice package for plotting heatmaps is the
pheatmap
package, which allows you to annotate both the rows and the columns of a matrix.Set the diagonal of mc2 = 0, so that the low-weighted edges are better represented.
Are there any differences you notice between this representation and the previous graph-based visualization?
diag(mc2) = 0
pheatmap(mc2[rownames(v_small)[order(v_small[,1])],rownames(v_small)[order(v_small[,1])]], annotation = v_small, show_rownames = F, show_colnames = F, cluster_rows = F, cluster_cols = F)
Having evaluated the behavior of spectral clustering on a small example, we will now apply it to the full data set.
Perform spectral clustering on the full kernel matrix, mc. As before, remember to pass it
.asKernelMatrix()
.For this dataset, set parameter
centers = 17
.As before, make an annotation data.frame, v, to store each of the cluster assignments
Plot the graph, coloring the nodes according to their cluster membership
mc_specc = specc(as.kernelMatrix(mc), centers = 17)
v = as.data.frame(factor(mc_specc))
rownames(v) = rownames(mc)
colors = c(RColorBrewer::brewer.pal(9,"Set1"),RColorBrewer::brewer.pal(9,"Set3"),RColorBrewer::brewer.pal(8,"Set2"))
plot.igraph(g,vertex.label=NA,vertex.color=v[names(V(g)), ], edge.curved = T, vertex.size = 7, palette = colors)
You will notice that I have provided you with the number of clusters (or centers) to detect in the dataset. In the following lecture and tutorial you will see how I selected this value.
As an alternative to node-based clustering, we will conclude by performing link-community detection on the gene-gene similarity kernel.
As you will recall from the lecture, link-community detection inverts the clustering problem. Rather than cluster similar nodes, it instead tries to clusters similar edges.
Interestingly: Given this formulation, nodes can now belong to several clusters. This is particularly useful in biological contexts, where an investigator might like to express, for example, how the role or function of a particular gene may vary across different contexts.
You can learn more about link-community detection in the original paper
Perform link-community clustering on the full kernel matrix, mc, using the function
getLinkCommunities()
from thelinkcomm
package.To pass the the kernel matrix to link community algorithm, you will need to format it as an n x 3 edge list matrix, i.e.,
[node1, node2, weight]
, where n is the number of pairs of nodes in the network. An easy way to get this matrix is from theigraph
graph objectg
that we created previously using the functionas_edgelist()
Be sure to set the
directed = FALSE
for link community clustering.
# link communities
mc_edge = as_edgelist(g)
mc_edge = data.frame(node1 = mc_edge[,1], node2 = mc_edge[,2], weight = E(g)$weight)
g_linkcomm = getLinkCommunities(mc_edge, directed = FALSE, plot = FALSE)
Unlike spectral clustering, the link-community detection algorithm will automatically select the number of clusters. In the next lecture and practical session we will follow up on both networks, learn how the number of clusters was determined for both approaches.
For now, you can rest easy.