Companion lecture

Home

At the end of the previous session, we processed Jean-Karim’s kernels, combining them into a single graph. In this session, we will apply two clustering techniques to the integrated data: spectral clustering and link-community clustering.

Spirals: Spectral Clustering

As you will recall from the lecture, spectral clustering is a technique for clustering based on transformation by dimensionality reduction.

Before applying spectral clustering to our data, we will derive a simple algorithm for spectral clustering. This will allow us to see more clearly how spectral clustering relates to k-means. We will once again explore the spirals data used during the first tutorial.

The algorithm we will generate is similar to that described by Ng et al. On Spectral Clustering: Analysis and an algorithm. The R code was obtained and adapted from João Neto

If it is not already loaded, please load the spirals dataset, which is included in the kernlab package.

# libraries
library(pheatmap)
library(dplyr)
library(reshape2)
library(igraph)
library(linkcomm)
library(kernlab)
# data
data(spirals)

To remind you, spectral clustering appeared to provide a more desired result when clustering data sets that are separated poorly by standard k-means. Below I plot the data and the clustering result for both algorithms.

In what follows, we will write an algorithm to perform spectral clustering. We will see - perhaps surprisingly - that spectral clustering boils down to performing k-means on eigenvectors of a similarity kernel applied to the original data.

To help you write your code, João Neto has provided several pre-rolled functions, including:

s <- function(x1, x2, alpha=1) {
  exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}
make.affinity <- function(S, n.neighboors=2) {
  N <- length(S[,1])

  if (n.neighboors >= N) {  # fully connected
    A <- S
  } else {
    A <- matrix(rep(0,N^2), ncol=N)
    for(i in 1:N) { # for each line
      # only connect to those points with larger similarity
      best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
      for (s in best.similarities) {
        j <- which(S[i,] == s)
        A[i,j] <- S[i,j]
        A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
      }
    }
  }
  A
}

Your algorithm should include the following steps:

  1. Compute the similarity matrix, S, between all points in the spiral dataset
  2. Calculate the affinity matrix, A, from S by applying k-nearest neighbors algorithm
  3. Compute the weighted degree diagonal matrix, D, from A by summing across each row
  4. Calculate the unnormalized Laplacian, U, by subtracting A from D
  5. Compute eigenvectors and eigenvalues of U.
  6. Perform k-means clustering on k smallest eigenvalues, ignoring the smallest (constant) eigenvector

We will perform these steps one at a time. If you find yourself spending more than 10-15 minutes on this section, please read the solution and move on to the following activities.

  1. Compute the similarity matrix, S, between all points in the spiral dataset
Solution
  make.similarity <- function(my.data, similarity) {
    N <- nrow(my.data)
    S <- matrix(rep(NA,N^2), ncol=N)
    for(i in 1:N) {
      for(j in 1:N) {
        if (i!=j) {
          S[i,j] <- similarity(my.data[i,], my.data[j,])
        } else {
          S[i,j] <- 0
        }
      }
    }
    S
  }

  S <- make.similarity(spirals, s)
  1. Calculate the affinity matrix, A, from S by applying k-nearest neighbors algorithm
Solution
A <- make.affinity(S, 3)  # use 3 neighbors (includes self)
  1. Compute the weighted degree diagonal matrix, D, from A by summing across each row
Solution
D <- diag(apply(A, 1, sum))
  1. Calculate the unnormalized Laplacian, U, by subtracting A from D
Solution
U <- D - A
  1. Compute eigenvectors and eigenvalues of U.
Solution
evL <- eigen(U, symmetric=TRUE)
  1. Perform k-means clustering on matrix, Z, consisting of eigenvectors for k smallest eigenvalues, ignoring the smallest (constant) eigenvector
Solution
k   <- 2
Z   <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
km <- kmeans(Z, centers=k, nstart=5)

Plot the eigenvector matrix (Z) as well as the original data Are the two spiral clusters correctly identified? What does the data look like in the transformed space?

Solution
par(mfrow=c(1,2))
plot(Z, col=km$cluster, pch = 19, xlab = "Eigenvector 1", ylab = "Eigenvector 2")
plot(spirals, col=km$cluster, pch = 19, xlab = "", ylab = "")

You may have noticed a parameter or two lurking around, for example the n.neighboors parameter in the k-nearest neighbors function. These parameters would need to be adjusted depending on the data or estimated automatically. In addition, it should be noted that there are several implementations of spectral clustering, using any of a number of different laplacians - you have just implemented a relatively simple version for the sake of clarity. For more detailed information, please refer to Ng et al. On Spectral Clustering: Analysis and an algorithm

Load Kernel Data

Load the combined kernel and graph produced in the preceding practical

GITHUBDIR = "http://scalefreegan.github.io/Teaching/DataIntegration/data/"
load(url(paste(GITHUBDIR, "kernel.rda", sep = "")))

Spectral Clustering: subset

Before clustering the full dataset - which can be hard to visualize and evaluate - we will apply spectral clustering on a representative subset of the graph so that we can visualize its behavior.

I have selected 4 vertices with fairly high degree (degree = 20) from the matrix, mc, and selected all of their first neighbors. The sub-matrix is called, mc2.

degree = apply(mc, 1, function(i)sum(i > 0))
n_i = which(degree == 20)[1:4]
n = unique(c(n_i, which(mc[n_i, ] > 0, arr.ind = T)[, 2]))
mc2 = mc[n, n]

Make an igraph network from the sub-matrix mc2.

Plot the graph. How many clusters do you expect based on this visualization?

Solution
g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
g_layout = layout.fruchterman.reingold(g_small)
plot.igraph(g_small, vertex.label = NA, layout = g_layout)

Perform spectral clustering on the sub-matrix mc2. Remember that the matrix is a kernel. Therefore, you should pass it to corresponding functions using the function as.kernelMatrix()

Spectral clustering can be performed with the function specc from the kernlab package. Choose the number of clusters (centers) to be equal to the number you estimated above.

Make an annotation data.frame, v_small, that contains a single column denoting the cluster membership of each gene. The values should be encoded as factors. The rownames of this matrix should correspond to the rownames of mc2.

Solution
mc2_specc = specc(as.kernelMatrix(mc2), centers = 3)
v_small = as.data.frame(factor(mc2_specc))
rownames(v_small) = rownames(mc2)

Make an igraph network from the kernel matrix, mc2.

Plot the network, coloring each node by cluster membership as contained in the data.frame, v.

How do the clusters correspond to those you predicted before clustering?

Solution
g_small = graph_from_adjacency_matrix(mc2, mode = "undirected", weighted = TRUE, diag = FALSE)
plot.igraph(g_small,vertex.label=NA,vertex.color=v_small[names(V(g_small)), ], layout = g_layout)

Plot a heatmap representation of the kernel matrix, mc2. Annotate each column (gene) with its corresponding cluster membership. A nice package for plotting heatmaps is the pheatmap package, which allows you to annotate both the rows and the columns of a matrix.

Set the diagonal of mc2 = 0, so that the low-weighted edges are better represented.

Are there any differences you notice between this representation and the previous graph-based visualization?

Solution
diag(mc2) = 0
pheatmap(mc2[rownames(v_small)[order(v_small[,1])],rownames(v_small)[order(v_small[,1])]], annotation = v_small, show_rownames = F, show_colnames = F, cluster_rows = F, cluster_cols = F)

Spectral Clustering: full

Having evaluated the behavior of spectral clustering on a small example, we will now apply it to the full data set.

Perform spectral clustering on the full kernel matrix, mc. As before, remember to pass it .asKernelMatrix().

For this dataset, set parameter centers = 17.

As before, make an annotation data.frame, v, to store each of the cluster assignments

Plot the graph, coloring the nodes according to their cluster membership

Solution
mc_specc = specc(as.kernelMatrix(mc), centers = 17)
v = as.data.frame(factor(mc_specc))
rownames(v) = rownames(mc)
colors = c(RColorBrewer::brewer.pal(9,"Set1"),RColorBrewer::brewer.pal(9,"Set3"),RColorBrewer::brewer.pal(8,"Set2"))
plot.igraph(g,vertex.label=NA,vertex.color=v[names(V(g)), ], edge.curved = T, vertex.size = 7, palette = colors)

You will notice that I have provided you with the number of clusters (or centers) to detect in the dataset. In the following lecture and tutorial you will see how I selected this value.