Cluster detection, analysis and visualization

An Introduction

Author: Aaron Brooks / @scalefreegan


You can follow along on
http://scalefreegan.github.io/Teaching/DataIntegration

Course Overview

  • Introduction
  • Cluster Detection
  • Cluster Evaluation
  • Cluster Visualization and Interpretation

50/50 Lectures and Practicals

Goals: Introduction

  • What is clustering? Why cluster? Role in data integration
  • Motivating biological and non-biological examples
  • Approaches to clustering
  • Overview of lab data set and practicals

Clustering in a nutshell

Some way to group elements that are more similar to each other than they are to everything else

What does it mean to be more similar?

Data Clusters

scikit-learn

Network Clusters

Brooks and Reiss et al. (2014)

Many approaches to clustering

Hierarchical agglomerative clustering

K-means

Many others

Different approaches. Different metrics.

Distinguishing features of clustering methods

Hard vs Soft clustering

Hard Clustering

Elements belong uniquely to one cluster

Soft (Fuzzy) Clustering

Elements belong to more than one cluster

1-dimensional data

A Tutorial on Clustering Algorithms

Hard Membership: Step-function

A Tutorial on Clustering Algorithms

Fuzzy Membership: Smooth-function

A Tutorial on Clustering Algorithms

Fuzzy C-means (FCM)

A Tutorial on Clustering Algorithms

Gaussian Mixture Model with EM

A Tutorial on Clustering Algorithms

Clustering Nodes vs Edges

Nodes: Spectral clustering

Edges: Link-community clustering

Why cluster a network?

Define topologically interesting substructure

Examples

Karate Club

Zachary karate club

Yeast protein-protein interaction network

Network biology: understanding the cell's functional organization

Human disease network

The human disease network

Why cluster a network?

Clustering is a way to decipher network structure

Hairball ⇨ Comprehension

Clustering for Data Integration

Kernels derived from 5 sources of information

Similarities between 4567 human genes

Selected from their relationship to 120 genes involved in mitosis, DNA mismatch repair, and BMP signaling

Course goals

  1. Combine graph kernels
  2. Cluster integrated network

Please continue to Practical #1