Welcome to Cloe
Cloe (pronounced like the name Chloë) is a computational biology tool to infer the clonal structure of heterogeneous tumour samples. It implements a phylogenetic latent feature model that discovers hierarchically-related patterns (clonal genotypes) in the samples, and with these describes the observed mutation data.
Version 1.0 (2018-11-22)
- Mutation clustering
- This step improves runtime and enables large-scale analyses (1,000-100,000 mutations and >10 clones) with Cloe.
- A Chinese Restaurant Process clusters mutations with an infinite mixture of binomial distributions, and automatically identifies how many clusters are needed. Do check the resulting clusters before proceeding, to ensure that different biological signals have not been placed in the same cluster. This step works best if you have multiple samples.
- The code for clustering is written in C++11 with RcppArmadillo.
- Optimisation of clonal fractions
- Clonal fractions are no longer sampled by the MCMCMC sampler, but
limSolve::lsei) given the data and the current genotypes. This is faster and helps mixing.
- Clonal fractions are no longer sampled by the MCMCMC sampler, but optimised (
- Updated tree updates
- The tree is now updated by Gibbs sampling and with a prune-regraft step.
Gibbs sampling goes through each node
kand looks for a new parent among all nodes outside of
k's subtree. Prune-regraft is a joint update of tree, genotypes (and fractions): genotypes of the moved subtree are updated so as to fit with the new parent; fractions are optimised given the new genotypes.
- The tree is now updated by Gibbs sampling and with a prune-regraft step. Gibbs sampling goes through each node
- Genotypes updated a random portion at a time.
- Because genotypes and fractions keep each other in place during inference, a smaller genotypes update is performed, taking a random portion of mutations each time.
- Added AIC and WAIC for model selection
- Simpler ISA
- Parallel mutations are defined as mutations (the current clone has the
mutation, its parent does not) that occur despite having already appeared
in the tree before. A previously seen mutation happens with a modified
mu * nu, where
nuis the ISA penalty, instead of
- Parallel mutations are defined as mutations (the current clone has the mutation, its parent does not) that occur despite having already appeared in the tree before. A previously seen mutation happens with a modified probability
- cowplot is now used for all plots
- Clones have been renamed
- The normal clone is now called N (instead of C1), while the first non-normal clone is now C1 (instead of C2).
- Classes have changed somewhat
- All three classes have changed a bit to cope with the novelties.
- Leaner code
Thanks to Jack Kuipers for useful discussions on some of these updates.
Cloe has been developed with
R >= 3.2.1. It has been tested on Linux (Debian
stable) and Mac OS X (10.8.5 and later).
To install its R dependencies, run the following:
install.packages( c( "R6", "cowplot", "digest", "ggplot2", "igraph", "limSolve", "RColorBrewer", "Rcpp", "RcppArmadillo", "reshape2", "scales" ) )
You should be able to install them automatically with the command below.
Installing Cloe can be done directly from this repository:
library(devtools) install_bitbucket("fm361/cloe", dependencies=TRUE)
If you have pandoc installed, you can also build the vignette:
Ready to go
If the above commands have run successfully, you will be ready to run Cloe. Please refer to the vignette for a tutorial on how to run Cloe. For a quick overview of Cloe's workflow, read on.
Running Cloe consists of four steps (plus an optional one):
- Create an input object
- Cluster the mutations (optional)
- Run the sampler
- Get the best sets of parameters
- Select the model
Here is a brief example:
library(cloe) # 0. Load in the data reads <- as.matrix(read.table("reads.txt", header=TRUE, row.names=1)) depths <- as.matrix(read.table("depths.txt", header=TRUE, row.names=1)) # 1. Create an input object ci <- cloe_input$new(reads, depths) # plot(ci) # 1.5. Cluster the mutations # ci <- crp(ci) # plot(ci) # 2. Run the sampler cm4 <- sampler(input=ci, iterations=10000, K=4, chains=1) # plot(cm4) # 3. Get the best sets of parameters cs4 <- summarise(cm=cm4, burn=0.5, thin=20) # plot(cs4) # 4. Select the model # # css <- list(cs3, cs4, cs5) # top_cs <- select_model(l=css, solutions=6L, plot=TRUE)
In the optional, but recommended, step 1.5, you can cluster mutations with a
Chinese Restaurant Process. Plot the resulting object to ensure that the
clustering has not mixed different biological signals into the same cluster. If
that happened, rerun
crp with a larger value of
?crp for more
In step 2, the sampler runs our MCMCMC algorithm using the number of clones
that you specify. If you do not know how many clones are present in the data,
you should run the sampler for several likely values, and select "the best
model" in step 4.
By default Cloe runs 4 parallel tempered chains. You can change this behaviour
by specifying how many chains you wish and their temperatures (e.g.
chains=2, temperatures=c(1, 0.9)). There is no point in running multiple
parallel chains if they do not swap their states efficiently and throughout the
run. To check that all went smoothly, plot the
cloe_mcmc object returned by
sampler(). If some chains are not swapping, reduce the temperature intervals
summarise function of step 3 discards iterations at the beginning of the
chain with the
burn option (it takes a proportion of the iterations, e.g.
burn=0.5 discards the first half of the chain), it thins the chain taking
every i^th iteration with
thin=i, and returns a number of solutions sorted by
decreasing log-posterior probability.
Note: you can plot all of Cloe's classes, and plots are automatically written to disk. This behaviour may change in the future.
select_model returns a list of
cloe_summary objects sorted by the chosen
?select_model for more information). The model selection plots
show the log-likelihood, log-posterior, AIC and WAIC. You would want to choose
the simplest model that best explains the data. As proxies for this, look for
high log-posterior and log-likelihood values, and low AIC and WAIC.
Cloe's validation dataset (mixtures of single-cell diluted cell lines) is available within Cloe's package.
library(cloe) # data reads <- cloe_val_reads depths <- cloe_val_depths # correct clonal structure correct_genotypes <- cloe_val_Z correct_fractions <- cloe_val_F
For more information please refer to the html vignette and to the R documentation of methods and classes.
Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. 2016. A phylogenetic latent feature model for clonal deconvolution. The Annals of Applied Statistics. 10(4):2377-2404.
Francesco Marass ( francesco.marass __ bsse.ethz.ch )