Welcome to the first practical lesson for the clustering section of the EMBL course: Data mining and integration with networks.
This document will introduce basic concepts related to clustering and provide a gentle introduction to clustering with a simple example. At the end of the practical we will read and process Jean-Karim’s kernels so that they are ready to analyze in the subsequent tutorials.
First, a word about the format of these practicals.
The practicals are designed to allow self-guided exploration of the course materials. Each practical is paired with information from the preceding lecture. You will be able to explore this document at your own pace and refer to it again in the future.
In addition to the text, be on the lookout for three special formats in these documents. These sections include scripts, exercises and solutions, respectively.
Scripts
Scripts are denoted by blocked text. You can copy and paste these commands into the R terminal. Give the command below a try.
cat("Welcome to the first practical on clustering\n")
Excercises
Throughout the practical you will be asked to evaluate your understanding by writing code. You should solve these exercise by writing your own script to perform the requested operations or answer the question. Exercises are denoted by blocked quotation, as below.
This is an exercise.
You should write your own code to solve it.
Solutions
If you get stuck or to see the answer, use the solution pulldown to reveal the answer to the problem, including accompanying code.
The solution pulldowns look like this. Give the one below a try:
Don’t cheat (too much). These exercises are designed to make sure that you’re keeping up. Give it a solid effort before revealing the answer.
This tutorial requires a number of R packages. To install theses packages you can run the script included below.
PLEASE NOTE: All packages should be installed on the computers used for the Data Integration course.
install.packages("devtools")
devtools::source_url("https://raw.githubusercontent.com/scalefreegan/Teaching/master/DataIntegration/scripts/installPackages.R")
Clustering is an approach to group similar elements of a matrix or graph. The objective is to group the elements such that the those within a cluster are more similar to one another than they are to everything else. Finding the optimal partitioning of data into k clusters is the objective of most clustering algorithms.
There are many different approaches to clustering. Here we will introduce clustering results for data with simple and more complex structure. In the subsequent lectures and tutorials we will return these examples to understand how the algorithms work and how to assess clustering results.
We will focus on the iris
flower data set first introduced by Ronald Fisher in 1936. The iris
data set contains 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
Since there are 3 species in the data set, we will see how well three clusters (one for each species) can describe differences between the length and width of the sepals. As we will see later, choosing the correct number of clusters k isn’t usually so trivial.
Plot
Sepal.Length
versusSepal.Width
for the internal R data setiris
.Estimate the number of clusters required to describe this data by visual inspection.
plot(iris$Sepal.Length, iris$Sepal.Width, pch = 19)
Try to cluster these data using the function
kmeans
, with the parametercenters
set to the number of clusters you estimated above.Plot the data with each of the data points colored by its cluster membership.
The data plotted with the clusters (color) and their centers (star) is included below:
ikm <- kmeans(cbind(iris$Sepal.Length, iris$Sepal.Width), centers = 3)
plot(iris$Sepal.Length, iris$Sepal.Width, col=ikm$cluster, pch = 19)
points(ikm$centers, col=1:3, pch=8, cex=2)
We can see how well these clusters correspond to the species labels.
table(iris$Species, ikm$cluster)
##
## 1 2 3
## setosa 22 0 28
## versicolor 11 23 16
## virginica 1 42 7
As you can see the classification is pretty good, but not perfect - especially for the species that are more poorly resolved by these two measurements.
We will return to the details of the k-means clustering algorithm in the next lecture.
But what happens if the patterns in your data are more complicated?
Consider the following data set. How many clusters do you see?
Run k-means (# clusters = 2) on the spiral data. You can load this data set with
data(spirals)
. K-means can be run with the commandkmeans
. Specify the number of clusters withcenters = 2
.Plot the
spirals
data with clusters annotated. What do you observe? Are these the clusters what you would expect?
library(kernlab)
data(spirals)
km <- kmeans(spirals, centers = 2)
plot(spirals, col = km$cluster, pch = 19, xlab = "", ylab = "")
points(km$centers, col=1:2, pch=8, cex=2)
For some data with more complicated (e.g., non-linear) relationships between similar points, alternative approaches like spectral clustering may be more appropriate.
Run spectral clustering (# clusters = 2) on the
spirals
data. Spectral clustering is called with the functionspecc
. Like k-means, specify the number of clusters withcenters = 2
. Once again, plot the data with clusters annotated.What do you observe now?
sc <- specc(spirals, centers = 2)
plot(spirals, col = sc, pch = 19, xlab = "", ylab = "")
We will return again to spectral clustering with more theoretical and practical information in the subsequent lecture and tutorial
Before moving on to the next lecture, we will process Jean-Karim’s similarity kernels so that they are ready for subsequent lectures.
To remind you, Jean-Karim has provided us with several kernels that measure similarities between 120 human genes according to information derived from 5 sources. These sources include:
Type | Description | # Edges |
---|---|---|
BP | Pairwise gene semantic similarities (Resnik score) in the biological process domain | 909 |
CODA80 | Predicted protein interactions from domain co-occurrence/fusion events | 273 |
HIPPO | Protein interactions transferred from model organisms to human by orthology | 1432 |
PI | Protein interactions in human (from multiple databases) | 8335 |
TM | Protein interactions derived from the iHOP text mining method | 869 |
The 120 genes in our data set were chosen for their role in the following biological processes:
The remaining 4447 genes have a predicted interaction with one of the 120 genes according to one of the 5 measures described above.
The first thing we need to do is combine each of these kernels into a single matrix (graph). It is on this combined graph that we will perform clustering in the following tutorials.
I have written a script to fetch and process the data. Run the following code to automagically retrieve the data.
devtools::source_url("https://raw.githubusercontent.com/scalefreegan/Teaching/master/DataIntegration/scripts/readData.R")
The data is returned as a 11060 by 7 data.frame with the following structure:
head(data)
## gene1 gene2 BP_resnik CODA80 HIPPO PI TM
## 1 ENSG00000000971 ENSG00000132693 NA NA NA 1 33.33333
## 2 ENSG00000001626 ENSG00000102189 NA NA NA 1 NA
## 3 ENSG00000001626 ENSG00000114520 NA NA NA 1 NA
## 4 ENSG00000001626 ENSG00000137642 NA NA NA 1 NA
## 5 ENSG00000001626 ENSG00000174371 NA NA NA 1 NA
## 6 ENSG00000002016 ENSG00000010292 NA NA NA 1 NA
We can visualize each of the component networks to get a sense for their structure. Here I plot each data set as a network. To simply the graphs, I only include genes with non-zero entries for each of the component networks (so each is a subset of the complete dataset we just retrieved).
Keep the topology of these individual networks in mind as we go forward combining them.
As you will recall from Jean-Karim’s lecture, a valid kernel should be positive semidefinite (i.e., a Hermitian matrix whose eigenvalues are nonnegative). One way to make sure that our kernels are positive semidefinite is to add a sufficiently large value to the diagonal. For example, you can add the absolute value of the smallest eigenvalue to the diagonal. Once the validity of the kernels is established, each of them can simply be added together to produce a combined kernel.
Combine and normalize the kernels into a single matrix with values on the interval \([0,1]\), verifying that each is a valid kernel.
Return a composite matrix that has dimensions 4567 by 4567
To save some effort, I’ve included a function to normalize a matrix, \(K\), on the interval \([0,1]\).
normalizeKernel = function(K) {
# from Shawe-Taylor & Cristianini's "Kernel Methods for Pattern Analysis", p113
# original kernel matrix stored in variable K
# output uses the same variable K
# D is a diagonal matrix storing the inverse of the norms
# Based on MATLAB script from: www.kernel-methods.net
rnames = rownames(K)
cnames = colnames(K)
D = diag(1/sqrt(diag(K)))
K = D %*% K %*% D
rownames(K) = rnames
colnames(K) = cnames
return(K)
}
library(rARPACK) # for function eigs, Thanks Jean-Karim!
ms = lapply(colnames(data)[3:7], function(d){
z = select(data, gene1, gene2, which(colnames(data)==d))
z[is.na(z[,d]),d] = 0
g = sort(unique(c(z$gene1, z$gene2)))
m = matrix(0, nrow = length(g),ncol = length(g), dimnames = list(g,g))
m[cbind(z$gene1, z$gene2)] = z[,d]
m[cbind(z$gene2, z$gene1)] = z[,d]
# calc smales eigenvalue
eigen_m <- eigs(m,1,which="SR") # compute the smallest eigenvalue and corresponding eigenvector
# make sure m is a valid kernel by adding
# make matrix positive semi-definite
toadd = ceiling(abs(min(eigen_m$values)))
diag(m) = diag(m) + toadd
# you don't have to normalize every kernel individually
nm = normalizeKernel(m)
o = list()
o$m = m
o$nm = nm
return(o)
})
mc = ms[[1]]$nm
for (i in 2:length(ms)) {
mc = mc + ms[[i]]$nm
}
mc = normalizeKernel(mc)
Now that we have transformed the similarity kernels into a single combined matrix, we can visualize it as a graph using the R-package igraph
.
Plotting the graph can take some time, so I’ve simply included an image of it below. The default igraph layout for a graph of this size is the DrL layout, which is a force-directed layout suitable for larger graphs (# vertices > 1000). The resulting image gives us a first clue about how we might cluster the network.
library(igraph)
g = graph_from_adjacency_matrix(mc, mode = "undirected", weighted = TRUE, diag = FALSE)
plot.igraph(g, edge.width = E(g)$weight/(max(E(g)$weight)/5), vertex.label = NA, vertex.size = 5, edge.curved = T)
Do you see any structure in this network?
How would described that structure with an algorithm?