Introduction
Format of the practicals
Requirements
Clustering
- Clustering simple data
- More complicated data
Kernel data set

Introduction

Welcome to the first practical lesson for the clustering section of the EMBL course: Data mining and integration with networks.

This document will introduce basic concepts related to clustering and provide a gentle introduction to clustering with a simple example. At the end of the practical we will read and process Jean-Karim’s kernels so that they are ready to analyze in the subsequent tutorials.

Format of the practicals

First, a word about the format of these practicals.

The practicals are designed to allow self-guided exploration of the course materials. Each practical is paired with information from the preceding lecture. You will be able to explore this document at your own pace and refer to it again in the future.

In addition to the text, be on the lookout for three special formats in these documents. These sections include scripts, exercises and solutions, respectively.

Scripts

Scripts are denoted by blocked text. You can copy and paste these commands into the R terminal. Give the command below a try.

cat("Welcome to the first practical on clustering\n")

Excercises

Throughout the practical you will be asked to evaluate your understanding by writing code. You should solve these exercise by writing your own script to perform the requested operations or answer the question. Exercises are denoted by blocked quotation, as below.

This is an exercise.

You should write your own code to solve it.

Solutions

If you get stuck or to see the answer, use the solution pulldown to reveal the answer to the problem, including accompanying code.

The solution pulldowns look like this. Give the one below a try:

Solution

“Do. Or do not. There is no try.”

Don’t cheat (too much). These exercises are designed to make sure that you’re keeping up. Give it a solid effort before revealing the answer.

Requirements

This tutorial requires a number of R packages. To install theses packages you can run the script included below.

PLEASE NOTE: All packages should be installed on the computers used for the Data Integration course.

install.packages("devtools")
devtools::source_url("https://raw.githubusercontent.com/scalefreegan/Teaching/master/DataIntegration/scripts/installPackages.R")

Clustering

Clustering is an approach to group similar elements of a matrix or graph. The objective is to group the elements such that the those within a cluster are more similar to one another than they are to everything else. Finding the optimal partitioning of data into k clusters is the objective of most clustering algorithms.

There are many different approaches to clustering. Here we will introduce clustering results for data with simple and more complex structure. In the subsequent lectures and tutorials we will return these examples to understand how the algorithms work and how to assess clustering results.

Clustering simple data

We will focus on the iris flower data set first introduced by Ronald Fisher in 1936. The iris data set contains 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

Since there are 3 species in the data set, we will see how well three clusters (one for each species) can describe differences between the length and width of the sepals. As we will see later, choosing the correct number of clusters k isn’t usually so trivial.

Plot Sepal.Length versus Sepal.Width for the internal R data set iris.

Estimate the number of clusters required to describe this data by visual inspection.

Solution

plot(iris$Sepal.Length, iris$Sepal.Width, pch = 19)

Try to cluster these data using the function kmeans, with the parameter centers set to the number of clusters you estimated above.

Plot the data with each of the data points colored by its cluster membership.

Solution

The data plotted with the clusters (color) and their centers (star) is included below:

ikm <- kmeans(cbind(iris$Sepal.Length, iris$Sepal.Width), centers = 3)
plot(iris$Sepal.Length, iris$Sepal.Width, col=ikm$cluster, pch = 19)
points(ikm$centers, col=1:3, pch=8, cex=2)

We can see how well these clusters correspond to the species labels.

table(iris$Species, ikm$cluster)

##             
##               1  2  3
##   setosa     22  0 28
##   versicolor 11 23 16
##   virginica   1 42  7

As you can see the classification is pretty good, but not perfect - especially for the species that are more poorly resolved by these two measurements.

We will return to the details of the k-means clustering algorithm in the next lecture.

More complicated data

But what happens if the patterns in your data are more complicated?

Consider the following data set. How many clusters do you see?

Run k-means (# clusters = 2) on the spiral data. You can load this data set with data(spirals). K-means can be run with the command kmeans. Specify the number of clusters with centers = 2.

Plot the spirals data with clusters annotated. What do you observe? Are these the clusters what you would expect?

Solution

library(kernlab)
data(spirals)
km <- kmeans(spirals, centers = 2)
plot(spirals, col = km$cluster, pch = 19, xlab = "", ylab = "")
points(km$centers, col=1:2, pch=8, cex=2)

For some data with more complicated (e.g., non-linear) relationships between similar points, alternative approaches like spectral clustering may be more appropriate.

Run spectral clustering (# clusters = 2) on the spirals data. Spectral clustering is called with the function specc. Like k-means, specify the number of clusters with centers = 2. Once again, plot the data with clusters annotated.

What do you observe now?

Solution

sc <- specc(spirals, centers = 2)
plot(spirals, col = sc, pch = 19, xlab = "", ylab = "")

We will return again to spectral clustering with more theoretical and practical information in the subsequent lecture and tutorial

Kernel data set

Before moving on to the next lecture, we will process Jean-Karim’s similarity kernels so that they are ready for subsequent lectures.

To remind you, Jean-Karim has provided us with several kernels that measure similarities between 120 human genes according to information derived from 5 sources. These sources include:

Type	Description	# Edges
BP	Pairwise gene semantic similarities (Resnik score) in the biological process domain	909
CODA80	Predicted protein interactions from domain co-occurrence/fusion events	273
HIPPO	Protein interactions transferred from model organisms to human by orthology	1432
PI	Protein interactions in human (from multiple databases)	8335
TM	Protein interactions derived from the iHOP text mining method	869

The 120 genes in our data set were chosen for their role in the following biological processes:

chromosome condensation
DNA mismatch repair (KEGG:hsa03430)
mitotic spindle checkpoint
endocytosis (GO:0006897)
signaling by BMP

The remaining 4447 genes have a predicted interaction with one of the 120 genes according to one of the 5 measures described above.

The first thing we need to do is combine each of these kernels into a single matrix (graph). It is on this combined graph that we will perform clustering in the following tutorials.

Get data

I have written a script to fetch and process the data. Run the following code to automagically retrieve the data.

devtools::source_url("https://raw.githubusercontent.com/scalefreegan/Teaching/master/DataIntegration/scripts/readData.R")

The data is returned as a 11060 by 7 data.frame with the following structure:

head(data)

##             gene1           gene2 BP_resnik CODA80 HIPPO PI       TM
## 1 ENSG00000000971 ENSG00000132693        NA     NA    NA  1 33.33333
## 2 ENSG00000001626 ENSG00000102189        NA     NA    NA  1       NA
## 3 ENSG00000001626 ENSG00000114520        NA     NA    NA  1       NA
## 4 ENSG00000001626 ENSG00000137642        NA     NA    NA  1       NA
## 5 ENSG00000001626 ENSG00000174371        NA     NA    NA  1       NA
## 6 ENSG00000002016 ENSG00000010292        NA     NA    NA  1       NA

We can visualize each of the component networks to get a sense for their structure. Here I plot each data set as a network. To simply the graphs, I only include genes with non-zero entries for each of the component networks (so each is a subset of the complete dataset we just retrieved).

Keep the topology of these individual networks in mind as we go forward combining them.

Combine and normalize kernels

As you will recall from Jean-Karim’s lecture, a valid kernel should be positive semidefinite (i.e., a Hermitian matrix whose eigenvalues are nonnegative). One way to make sure that our kernels are positive semidefinite is to add a sufficiently large value to the diagonal. For example, you can add the absolute value of the smallest eigenvalue to the diagonal. Once the validity of the kernels is established, each of them can simply be added together to produce a combined kernel.

Combine and normalize the kernels into a single matrix with values on the interval \([0,1]\), verifying that each is a valid kernel.

Return a composite matrix that has dimensions 4567 by 4567

To save some effort, I’ve included a function to normalize a matrix, \(K\), on the interval \([0,1]\).

normalizeKernel = function(K) {
  # from Shawe-Taylor & Cristianini's "Kernel Methods for Pattern Analysis", p113
  # original kernel matrix stored in variable K
  # output uses the same variable K
  # D is a diagonal matrix storing the inverse of the norms
  # Based on MATLAB script from: www.kernel-methods.net
  rnames = rownames(K)
  cnames = colnames(K)
  D = diag(1/sqrt(diag(K)))
  K = D %*% K %*% D
  rownames(K) = rnames
  colnames(K) = cnames
  return(K)
}

Solution

library(rARPACK) # for function eigs, Thanks Jean-Karim!
ms = lapply(colnames(data)[3:7], function(d){
  z = select(data, gene1, gene2, which(colnames(data)==d))
  z[is.na(z[,d]),d] = 0
  g = sort(unique(c(z$gene1, z$gene2)))
  m = matrix(0, nrow = length(g),ncol = length(g), dimnames = list(g,g))
  m[cbind(z$gene1, z$gene2)] = z[,d]
  m[cbind(z$gene2, z$gene1)] = z[,d]

  # calc smales eigenvalue
  eigen_m <- eigs(m,1,which="SR") # compute the smallest eigenvalue and corresponding eigenvector

  # make sure m is a valid kernel by adding
  # make matrix positive semi-definite
  toadd = ceiling(abs(min(eigen_m$values)))
  diag(m) = diag(m) + toadd
  # you don't have to normalize every kernel individually
  nm = normalizeKernel(m)
  o = list()
  o$m = m
  o$nm = nm
  return(o)
})

mc = ms[[1]]$nm
for (i in 2:length(ms)) {
  mc = mc + ms[[i]]$nm
}
mc = normalizeKernel(mc)

Make a graph

Now that we have transformed the similarity kernels into a single combined matrix, we can visualize it as a graph using the R-package igraph.

Plotting the graph can take some time, so I’ve simply included an image of it below. The default igraph layout for a graph of this size is the DrL layout, which is a force-directed layout suitable for larger graphs (# vertices > 1000). The resulting image gives us a first clue about how we might cluster the network.

library(igraph)
g = graph_from_adjacency_matrix(mc, mode = "undirected", weighted = TRUE, diag = FALSE)

plot.igraph(g, edge.width = E(g)$weight/(max(E(g)$weight)/5), vertex.label = NA, vertex.size = 5, edge.curved = T)

Do you see any structure in this network?

How would described that structure with an algorithm?

Practical 1: Introduction

EMBL Course: Data mining and integration with networks

Aaron Brooks

This should take about 25 minutes

Introduction

Format of the practicals

Solution

“Do. Or do not. There is no try.”

Requirements

Clustering

Clustering simple data

Solution

Solution

More complicated data

Solution

Solution

Kernel data set

Get data

Combine and normalize kernels

Solution

Make a graph