At the end of the previous session, we processed Jean-Karim’s kernels, combining them into a single graph. In this session, we will apply two clustering techniques to the integrated data: spectral clustering and link-community clustering.

Spirals: Spectral Clustering

As you will recall from the lecture, spectral clustering is a technique for clustering based on transformation by dimensionality reduction.

Before applying spectral clustering to our data, we will derive a simple algorithm for spectral clustering. This will allow us to see more clearly how spectral clustering relates to k-means. We will once again explore the spirals data used during the first tutorial.

The algorithm we will generate is similar to that described by Ng et al. On Spectral Clustering: Analysis and an algorithm. The R code was obtained and adapted from João Neto

If it is not already loaded, please load the spirals dataset, which is included in the kernlab package.

# libraries
library(pheatmap)
library(dplyr)
library(reshape2)
library(igraph)
library(linkcomm)
library(kernlab)
# data
data(spirals)

To remind you, spectral clustering appeared to provide a more desired result when clustering data sets that are separated poorly by standard k-means. Below I plot the data and the clustering result for both algorithms.

In what follows, we will write an algorithm to perform spectral clustering. We will see - perhaps surprisingly - that spectral clustering boils down to performing k-means on eigenvectors of a similarity kernel applied to the original data.

To help you write your code, João Neto has provided several pre-rolled functions, including:

A Gaussian kernel, s, to calculate the similarity between two points.

s <- function(x1, x2, alpha=1) {
  exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}

A function, make.affinity to compute a restricted (or filtered) “affinity” between vertices using k-nearest neighbors.

make.affinity <- function(S, n.neighboors=2) {
  N <- length(S[,1])

  if (n.neighboors >= N) {  # fully connected
    A <- S
  } else {
    A <- matrix(rep(0,N^2), ncol=N)
    for(i in 1:N) { # for each line
      # only connect to those points with larger similarity
      best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
      for (s in best.similarities) {
        j <- which(S[i,] == s)
        A[i,j] <- S[i,j]
        A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
      }
    }
  }
  A
}

Your algorithm should include the following steps:

Compute the similarity matrix, S, between all points in the spiral dataset
Calculate the affinity matrix, A, from S by applying k-nearest neighbors algorithm
Compute the weighted degree diagonal matrix, D, from A by summing across each row
Calculate the unnormalized Laplacian, U, by subtracting A from D
Compute eigenvectors and eigenvalues of U.
Perform k-means clustering on k smallest eigenvalues, ignoring the smallest (constant) eigenvector

We will perform these steps one at a time. If you find yourself spending more than 10-15 minutes on this section, please read the solution and move on to the following activities.

Compute the similarity matrix, S, between all points in the spiral dataset

Solution

  make.similarity <- function(my.data, similarity) {
    N <- nrow(my.data)
    S <- matrix(rep(NA,N^2), ncol=N)
    for(i in 1:N) {
      for(j in 1:N) {
        if (i!=j) {
          S[i,j] <- similarity(my.data[i,], my.data[j,])
        } else {
          S[i,j] <- 0
        }
      }
    }
    S
  }

  S <- make.similarity(spirals, s)

Calculate the affinity matrix, A, from S by applying k-nearest neighbors algorithm

Solution

A <- make.affinity(S, 3)  # use 3 neighbors (includes self)

Compute the weighted degree diagonal matrix, D, from A by summing across each row

Solution

D <- diag(apply(A, 1, sum))

Calculate the unnormalized Laplacian, U, by subtracting A from D

Solution

U <- D - A

Compute eigenvectors and eigenvalues of U.

Solution

evL <- eigen(U, symmetric=TRUE)

Perform k-means clustering on matrix, Z, consisting of eigenvectors for k smallest eigenvalues, ignoring the smallest (constant) eigenvector

Solution

k   <- 2
Z   <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
km <- kmeans(Z, centers=k, nstart=5)

Plot the eigenvector matrix (Z) as well as the original data Are the two spiral clusters correctly identified? What does the data look like in the transformed space?

Solution

par(mfrow=c(1,2))
plot(Z, col=km$cluster, pch = 19, xlab = "Eigenvector 1", ylab = "Eigenvector 2")
plot(spirals, col=km$cluster, pch = 19, xlab = "", ylab = "")

You may have noticed a parameter or two lurking around, for example the n.neighboors parameter in the k-nearest neighbors function. These parameters would need to be adjusted depending on the data or estimated automatically. In addition, it should be noted that there are several implementations of spectral clustering, using any of a number of different laplacians - you have just implemented a relatively simple version for the sake of clarity. For more detailed information, please refer to Ng et al. On Spectral Clustering: Analysis and an algorithm

Load Kernel Data

Load the combined kernel and graph produced in the preceding practical

GITHUBDIR = "http://scalefreegan.github.io/Teaching/DataIntegration/data/"
load(url(paste(GITHUBDIR, "kernel.rda", sep = "")))

Spectral Clustering: subset

Before clustering the full dataset - which can be hard to visualize and evaluate - we will apply spectral clustering on a representative subset of the graph so that we can visualize its behavior.

I have selected 4 vertices with fairly high degree (degree = 20) from the matrix, mc, and selected all of their first neighbors. The sub-matrix is called, mc2.

degree = apply(mc, 1, function(i)sum(i > 0))
n_i = which(degree == 20)[1:4]
n = unique(c(n_i, which(mc[n_i, ] > 0, arr.ind = T)[, 2]))
mc2 = mc[n, n]

Make an igraph network from the sub-matrix mc2.

Plot the graph. How many clusters do you expect based on this visualization?

Solution

g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
g_layout = layout.fruchterman.reingold(g_small)
plot.igraph(g_small, vertex.label = NA, layout = g_layout)

Perform spectral clustering on the sub-matrix mc2. Remember that the matrix is a kernel. Therefore, you should pass it to corresponding functions using the function as.kernelMatrix()

Spectral clustering can be performed with the function specc from the kernlab package. Choose the number of clusters (centers) to be equal to the number you estimated above.

Make an annotation data.frame, v_small, that contains a single column denoting the cluster membership of each gene. The values should be encoded as factors. The rownames of this matrix should correspond to the rownames of mc2.

Solution

mc2_specc = specc(as.kernelMatrix(mc2), centers = 3)
v_small = as.data.frame(factor(mc2_specc))
rownames(v_small) = rownames(mc2)

Make an igraph network from the kernel matrix, mc2.

Plot the network, coloring each node by cluster membership as contained in the data.frame, v.

How do the clusters correspond to those you predicted before clustering?

Solution

g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
plot.igraph(g_small,vertex.label=NA,vertex.color=v_small[names(V(g_small)), ], layout = g_layout)

Plot a heatmap representation of the kernel matrix, mc2. Annotate each column (gene) with its corresponding cluster membership. A nice package for plotting heatmaps is the pheatmap package, which allows you to annotate both the rows and the columns of a matrix.

Set the diagonal of mc2 = 0, so that the low-weighted edges are better represented.

Are there any differences you notice between this representation and the previous graph-based visualization?

Solution

diag(mc2) = 0
pheatmap(mc2[rownames(v_small)[order(v_small[,1])],rownames(v_small)[order(v_small[,1])]], annotation = v_small, show_rownames = F, show_colnames = F, cluster_rows = F, cluster_cols = F)

Spectral Clustering: full

Having evaluated the behavior of spectral clustering on a small example, we will now apply it to the full data set.

Perform spectral clustering on the full kernel matrix, mc. As before, remember to pass it .asKernelMatrix().

For this dataset, set parameter centers = 17.

As before, make an annotation data.frame, v, to store each of the cluster assignments

Plot the graph, coloring the nodes according to their cluster membership

Solution

mc_specc = specc(as.kernelMatrix(mc), centers = 17)
v = as.data.frame(factor(mc_specc))
rownames(v) = rownames(mc)
colors = c(RColorBrewer::brewer.pal(9,"Set1"),RColorBrewer::brewer.pal(9,"Set3"),RColorBrewer::brewer.pal(8,"Set2"))
plot.igraph(g,vertex.label=NA,vertex.color=v[names(V(g)), ], edge.curved = T, vertex.size = 7, palette = colors)

You will notice that I have provided you with the number of clusters (or centers) to detect in the dataset. In the following lecture and tutorial you will see how I selected this value.

Link-community detection

As an alternative to node-based clustering, we will conclude by performing link-community detection on the gene-gene similarity kernel.

As you will recall from the lecture, link-community detection inverts the clustering problem. Rather than cluster similar nodes, it instead tries to clusters similar edges.

Interestingly: Given this formulation, nodes can now belong to several clusters. This is particularly useful in biological contexts, where an investigator might like to express, for example, how the role or function of a particular gene may vary across different contexts.

You can learn more about link-community detection in the original paper

Perform link-community clustering on the full kernel matrix, mc, using the function getLinkCommunities() from the linkcomm package.

To pass the the kernel matrix to link community algorithm, you will need to format it as an n x 3 edge list matrix, i.e., [node1, node2, weight], where n is the number of pairs of nodes in the network. An easy way to get this matrix is from the igraph graph object g that we created previously using the function as_edgelist()

Be sure to set the directed = FALSE for link community clustering.

Solution

# link communities
mc_edge = as_edgelist(g)
mc_edge = data.frame(node1 = mc_edge[,1], node2 = mc_edge[,2], weight = E(g)$weight)
g_linkcomm = getLinkCommunities(mc_edge, directed = FALSE, plot = FALSE)

Unlike spectral clustering, the link-community detection algorithm will automatically select the number of clusters. In the next lecture and practical session we will follow up on both networks, learn how the number of clusters was determined for both approaches.

For now, you can rest easy.

Practical 2: Cluster Detection

EMBL Course: Data mining and integration with networks

Aaron Brooks

This should take about 45 minutes

Spirals: Spectral Clustering

Solution

Solution

Solution

Solution

Solution

Solution

Solution

Load Kernel Data

Spectral Clustering: subset

Solution

Solution

Solution

Solution

Spectral Clustering: full

Solution

Link-community detection

Solution