Welcome to Cloe

Cloe (pronounced like the name Chloƫ) is a computational biology tool to infer
the clonal structure of heterogeneous tumour samples. It implements a
phylogenetic latent feature model that discovers hierarchically-related
patterns (clonal genotypes) in the samples, and with these describes the
observed mutation data.


Version 1.0 (2018-11-22)

  • Mutation clustering
    • This step improves runtime and enables large-scale analyses
      (1,000-100,000 mutations and >10 clones) with Cloe.
    • A Chinese Restaurant Process clusters mutations with an infinite mixture
      of binomial distributions, and automatically identifies how many clusters
      are needed. Do check the resulting clusters before proceeding, to ensure
      that different biological signals have not been placed in the same
      cluster. This step works best if you have multiple samples.
    • The code for clustering is written in C++11 with RcppArmadillo.
  • Optimisation of clonal fractions
    • Clonal fractions are no longer sampled by the MCMCMC sampler, but
      optimised (limSolve::lsei) given the data and the current genotypes.
      This is faster and helps mixing.
  • Updated tree updates
    • The tree is now updated by Gibbs sampling and with a prune-regraft step.
      Gibbs sampling goes through each node k and looks for a new parent
      among all nodes outside of k's subtree. Prune-regraft is a joint update
      of tree, genotypes (and fractions): genotypes of the moved subtree are
      updated so as to fit with the new parent; fractions are optimised given
      the new genotypes.
  • Genotypes updated a random portion at a time.
    • Because genotypes and fractions keep each other in place during
      inference, a smaller genotypes update is performed, taking a random
      portion of mutations each time.
  • Added AIC and WAIC for model selection
  • Simpler ISA
    • Parallel mutations are defined as mutations (the current clone has the
      mutation, its parent does not) that occur despite having already appeared
      in the tree before. A previously seen mutation happens with a modified
      probability mu * nu, where nu is the ISA penalty, instead of mu.
  • cowplot is now used for all plots
  • Clones have been renamed
    • The normal clone is now called N (instead of C1), while the first
      non-normal clone is now C1 (instead of C2).
  • Classes have changed somewhat
    • All three classes have changed a bit to cope with the novelties.
  • Leaner code

Thanks to Jack Kuipers for useful discussions on some of these updates.


Cloe has been developed with R >= 3.2.1. It has been tested on Linux (Debian
stable) and Mac OS X (10.8.5 and later).

To install its R dependencies, run the following:


You should be able to install them automatically with the command below.


Installing Cloe can be done directly from this repository:

install_bitbucket("fm361/cloe", dependencies=TRUE)

If you have pandoc installed, you can also
build the vignette:

install_bitbucket("fm361/cloe", build_vignettes=TRUE)

Ready to go

If the above commands have run successfully, you will be ready to run Cloe.
Please refer to the vignette for a tutorial on how to run Cloe. For a quick
overview of Cloe's workflow, read on.

Running Cloe consists of four steps (plus an optional one):

  1. Create an input object
    1. Cluster the mutations (optional)
  2. Run the sampler
  3. Get the best sets of parameters
  4. Select the model

Here is a brief example:


# 0. Load in the data
reads  <- as.matrix(read.table("reads.txt", header=TRUE, row.names=1))
depths <- as.matrix(read.table("depths.txt", header=TRUE, row.names=1))

# 1. Create an input object
ci <- cloe_input$new(reads, depths)
# plot(ci)

# 1.5. Cluster the mutations
# ci <- crp(ci)
# plot(ci)

# 2. Run the sampler
cm4 <- sampler(input=ci, iterations=10000, K=4, chains=1)
# plot(cm4)

# 3. Get the best sets of parameters
cs4 <- summarise(cm=cm4, burn=0.5, thin=20)
# plot(cs4)

# 4. Select the model
# css <- list(cs3, cs4, cs5)
# top_cs <- select_model(l=css, solutions=6L, plot=TRUE)

In the optional, but recommended, step 1.5, you can cluster mutations with a
Chinese Restaurant Process. Plot the resulting object to ensure that the
clustering has not mixed different biological signals into the same cluster. If
that happened, rerun crp with a larger value of alpha (see ?crp for more

In step 2, the sampler runs our MCMCMC algorithm using the number of clones K
that you specify. If you do not know how many clones are present in the data,
you should run the sampler for several likely values, and select "the best
model" in step 4.

By default Cloe runs 4 parallel tempered chains. You can change this behaviour
by specifying how many chains you wish and their temperatures (e.g.
chains=2, temperatures=c(1, 0.9)). There is no point in running multiple
parallel chains if they do not swap their states efficiently and throughout the
run. To check that all went smoothly, plot the cloe_mcmc object returned by
sampler(). If some chains are not swapping, reduce the temperature intervals
between them.

The summarise function of step 3 discards iterations at the beginning of the
chain with the burn option (it takes a proportion of the iterations, e.g.
burn=0.5 discards the first half of the chain), it thins the chain taking
every i^th iteration with thin=i, and returns a number of solutions sorted by
decreasing log-posterior probability.

Note: you can plot all of Cloe's classes, and plots are automatically
written to disk. This behaviour may change in the future.

Model selection

select_model returns a list of cloe_summary objects sorted by the chosen
criterion (see ?select_model for more information). The model selection plots
show the log-likelihood, log-posterior, AIC and WAIC. You would want to choose
the simplest model that best explains the data. As proxies for this, look for
high log-posterior and log-likelihood values, and low AIC and WAIC.

Validation dataset

Cloe's validation dataset (mixtures of single-cell diluted cell lines) is
available within Cloe's package.


# data
reads  <- cloe_val_reads
depths <- cloe_val_depths

# correct clonal structure
correct_genotypes <- cloe_val_Z
correct_fractions <- cloe_val_F

Learn more

For more information please refer to the html vignette and to the R
documentation of methods and classes.


Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. 2016.
A phylogenetic latent feature model for clonal deconvolution.
The Annals of Applied Statistics. 10(4):2377-2404.


Francesco Marass ( francesco.marass __ )