HTTPS SSH

Welcome to Cloe

Cloe (pronounced like the name Chloƫ) is a computational biology tool to infer the clonal structure of heterogeneous tumour samples. It implements a phylogenetic latent feature model that discovers hierarchically-related patterns (clonal genotypes) in the samples, and with these describes the observed mutation data.

News

Version 1.0 (2018-11-22)

  • Mutation clustering
    • This step improves runtime and enables large-scale analyses (1,000-100,000 mutations and >10 clones) with Cloe.
    • A Chinese Restaurant Process clusters mutations with an infinite mixture of binomial distributions, and automatically identifies how many clusters are needed. Do check the resulting clusters before proceeding, to ensure that different biological signals have not been placed in the same cluster. This step works best if you have multiple samples.
    • The code for clustering is written in C++11 with RcppArmadillo.
  • Optimisation of clonal fractions
    • Clonal fractions are no longer sampled by the MCMCMC sampler, but optimised (limSolve::lsei) given the data and the current genotypes. This is faster and helps mixing.
  • Updated tree updates
    • The tree is now updated by Gibbs sampling and with a prune-regraft step. Gibbs sampling goes through each node k and looks for a new parent among all nodes outside of k's subtree. Prune-regraft is a joint update of tree, genotypes (and fractions): genotypes of the moved subtree are updated so as to fit with the new parent; fractions are optimised given the new genotypes.
  • Genotypes updated a random portion at a time.
    • Because genotypes and fractions keep each other in place during inference, a smaller genotypes update is performed, taking a random portion of mutations each time.
  • Added AIC and WAIC for model selection
  • Simpler ISA
    • Parallel mutations are defined as mutations (the current clone has the mutation, its parent does not) that occur despite having already appeared in the tree before. A previously seen mutation happens with a modified probability mu * nu, where nu is the ISA penalty, instead of mu.
  • cowplot is now used for all plots
  • Clones have been renamed
    • The normal clone is now called N (instead of C1), while the first non-normal clone is now C1 (instead of C2).
  • Classes have changed somewhat
    • All three classes have changed a bit to cope with the novelties.
  • Leaner code

Thanks to Jack Kuipers for useful discussions on some of these updates.

Requirements

Cloe has been developed with R >= 3.2.1. It has been tested on Linux (Debian stable) and Mac OS X (10.8.5 and later).

To install the R dependencies, run the following:

install.packages(
  c(
    "R6",
    "cowplot",
    "digest",
    "ggplot2",
    "igraph",
    "limSolve",
    "RColorBrewer",
    "Rcpp",
    "RcppArmadillo",
    "reshape2",
    "scales"
  )
)

Install

Installing Cloe can be done directly from this repository:

library(devtools)
install_bitbucket("fm361/cloe")

If you have pandoc installed, you can also build the vignette:

install_bitbucket("fm361/cloe", build_vignettes=TRUE)

Ready to go

If the above commands have run successfully, you will be ready to run Cloe. Please refer to the vignette for a tutorial on how to run Cloe. For a quick overview of Cloe's workflow, read on.

Running Cloe consists of four steps (plus an optional one):

  1. Create an input object
    1. Cluster the mutations (optional)
  2. Run the sampler
  3. Get the best sets of parameters
  4. Select the model

Here is a brief example:

library(cloe)

# 0. Load in the data
reads  <- as.matrix(read.table("reads.txt", header=TRUE, row.names=1))
depths <- as.matrix(read.table("depths.txt", header=TRUE, row.names=1))

# 1. Create an input object
ci <- cloe_input$new(reads, depths)
# plot(ci)

# 1.5. Cluster the mutations
# ci <- crp(ci)
# plot(ci)

# 2. Run the sampler
cm4 <- sampler(input=ci, iterations=10000, K=4, chains=1)
# plot(cm4)

# 3. Get the best sets of parameters
cs4 <- summarise(cm=cm4, burn=0.5, thin=20)
# plot(cs4)

# 4. Select the model
# 
# css <- list(cs3, cs4, cs5)
# top_cs <- select_model(l=css, solutions=6L, plot=TRUE)

In the optional, but recommended, step 1.5, you can cluster mutations with a Chinese Restaurant Process. Plot the resulting object to ensure that the clustering has not mixed different biological signals into the same cluster. If that happened, rerun crp with a larger value of alpha (see ?crp for more information).

In step 2, the sampler runs our MCMCMC algorithm using the number of clones K that you specify. If you do not know how many clones are present in the data, you should run the sampler for several likely values, and select "the best model" in step 4.

By default Cloe runs 4 parallel tempered chains. You can change this behaviour by specifying how many chains you wish and their temperatures (e.g. chains=2, temperatures=c(1, 0.9)). There is no point in running multiple parallel chains if they do not swap their states efficiently and throughout the run. To check that all went smoothly, plot the cloe_mcmc object returned by sampler(). If some chains are not swapping, reduce the temperature intervals between them.

The summarise function of step 3 discards iterations at the beginning of the chain with the burn option (it takes a proportion of the iterations, e.g. burn=0.5 discards the first half of the chain), it thins the chain taking every i^th iteration with thin=i, and returns a number of solutions sorted by decreasing log-posterior probability.

Note: you can plot all of Cloe's classes, and plots are automatically written to disk. This behaviour may change in the future.

Model selection

select_model returns a list of cloe_summary objects sorted by the chosen criterion (see ?select_model for more information). The model selection plots show the log-likelihood, log-posterior, AIC and WAIC. You would want to choose the simplest model that best explains the data. As proxies for this, look for high log-posterior and log-likelihood values, and low AIC and WAIC.

Validation dataset

Cloe's validation dataset (mixtures of single-cell diluted cell lines) is available within Cloe's package.

library(cloe)

# data
reads  <- cloe_val_reads
depths <- cloe_val_depths

# correct clonal structure
correct_genotypes <- cloe_val_Z
correct_fractions <- cloe_val_F

Learn more

For more information please refer to the html vignette and to the R documentation of methods and classes.

Citation

Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. 2016. A phylogenetic latent feature model for clonal deconvolution. The Annals of Applied Statistics. 10(4):2377-2404.

Contacts

Francesco Marass ( francesco.marass __ bsse.ethz.ch )