# Overview

Atlassian Sourcetree is a free Git and Mercurial client for Windows.

Atlassian Sourcetree is a free Git and Mercurial client for Mac.

# Welcome to Cloe

Cloe (pronounced like the name ChloĆ«) is a computational biology tool to infer the clonal structure of heterogeneous tumour samples. It implements a phylogenetic latent feature model that discovers hierarchically-related patterns (clonal genotypes) in the samples, and with these describes the observed mutation data.

### News

#### Version 1.0 (2018-11-22)

- Mutation clustering
- This step improves runtime and enables large-scale analyses (1,000-100,000 mutations and >10 clones) with Cloe.
- A Chinese Restaurant Process clusters mutations with an infinite mixture of binomial distributions, and automatically identifies how many clusters are needed. Do check the resulting clusters before proceeding, to ensure that different biological signals have not been placed in the same cluster. This step works best if you have multiple samples.
- The code for clustering is written in C++11 with RcppArmadillo.

- Optimisation of clonal fractions
- Clonal fractions are no longer sampled by the MCMCMC sampler, but
optimised (
`limSolve::lsei`

) given the data and the current genotypes. This is faster and helps mixing.

- Clonal fractions are no longer sampled by the MCMCMC sampler, but
optimised (
- Updated tree updates
- The tree is now updated by Gibbs sampling and with a prune-regraft step.
Gibbs sampling goes through each node
`k`

and looks for a new parent among all nodes outside of`k`

's subtree. Prune-regraft is a joint update of tree, genotypes (and fractions): genotypes of the moved subtree are updated so as to fit with the new parent; fractions are optimised given the new genotypes.

- The tree is now updated by Gibbs sampling and with a prune-regraft step.
Gibbs sampling goes through each node
- Genotypes updated a random portion at a time.
- Because genotypes and fractions keep each other in place during inference, a smaller genotypes update is performed, taking a random portion of mutations each time.

- Added AIC and WAIC for model selection
- Simpler ISA
- Parallel mutations are defined as mutations (the current clone has the
mutation, its parent does not) that occur despite having already appeared
in the tree before. A previously seen mutation happens with a modified
probability
`mu * nu`

, where`nu`

is the ISA penalty, instead of`mu`

.

- Parallel mutations are defined as mutations (the current clone has the
mutation, its parent does not) that occur despite having already appeared
in the tree before. A previously seen mutation happens with a modified
probability
- cowplot is now used for all plots
- Clones have been renamed
- The normal clone is now called N (instead of C1), while the first non-normal clone is now C1 (instead of C2).

- Classes have changed somewhat
- All three classes have changed a bit to cope with the novelties.

- Leaner code

Thanks to Jack Kuipers for useful discussions on some of these updates.

### Requirements

Cloe has been developed with `R >= 3.2.1`

. It has been tested on Linux (Debian
stable) and Mac OS X (10.8.5 and later).

To install the R dependencies, run the following:

install.packages( c( "R6", "cowplot", "digest", "ggplot2", "igraph", "limSolve", "RColorBrewer", "Rcpp", "RcppArmadillo", "reshape2", "scales" ) )

### Install

Installing Cloe can be done directly from this repository:

library(devtools) install_bitbucket("fm361/cloe")

If you have pandoc installed, you can also build the vignette:

install_bitbucket("fm361/cloe", build_vignettes=TRUE)

### Ready to go

If the above commands have run successfully, you will be ready to run Cloe.
Please refer to the **vignette** for a tutorial on how to run Cloe. For a quick
overview of Cloe's workflow, read on.

Running Cloe consists of four steps (plus an optional one):

- Create an input object
- Cluster the mutations (optional)

- Run the sampler
- Get the best sets of parameters
- Select the model

Here is a brief example:

library(cloe) # 0. Load in the data reads <- as.matrix(read.table("reads.txt", header=TRUE, row.names=1)) depths <- as.matrix(read.table("depths.txt", header=TRUE, row.names=1)) # 1. Create an input object ci <- cloe_input$new(reads, depths) # plot(ci) # 1.5. Cluster the mutations # ci <- crp(ci) # plot(ci) # 2. Run the sampler cm4 <- sampler(input=ci, iterations=10000, K=4, chains=1) # plot(cm4) # 3. Get the best sets of parameters cs4 <- summarise(cm=cm4, burn=0.5, thin=20) # plot(cs4) # 4. Select the model # # css <- list(cs3, cs4, cs5) # top_cs <- select_model(l=css, solutions=6L, plot=TRUE)

In the optional, but recommended, step 1.5, you can cluster mutations with a
Chinese Restaurant Process. Plot the resulting object to ensure that the
clustering has not mixed different biological signals into the same cluster. If
that happened, rerun `crp`

with a larger value of `alpha`

(see `?crp`

for more
information).

In step 2, the sampler runs our MCMCMC algorithm using the number of clones `K`

that you specify. If you do not know how many clones are present in the data,
you should run the sampler for several likely values, and select "the best
model" in step 4.

By default Cloe runs 4 parallel tempered chains. You can change this behaviour
by specifying how many chains you wish and their temperatures (e.g.
`chains=2, temperatures=c(1, 0.9)`

). There is no point in running multiple
parallel chains if they do not swap their states efficiently and throughout the
run. To check that all went smoothly, plot the `cloe_mcmc`

object returned by
`sampler()`

. If some chains are not swapping, reduce the temperature intervals
between them.

The `summarise`

function of step 3 discards iterations at the beginning of the
chain with the `burn`

option (it takes a proportion of the iterations, e.g.
`burn=0.5`

discards the first half of the chain), it thins the chain taking
every i^th iteration with `thin=i`

, and returns a number of solutions sorted by
decreasing log-posterior probability.

**Note:** you can plot all of Cloe's classes, and plots are automatically
written to disk. This behaviour may change in the future.

### Model selection

`select_model`

returns a list of `cloe_summary`

objects sorted by the chosen
criterion (see `?select_model`

for more information). The model selection plots
show the log-likelihood, log-posterior, AIC and WAIC. You would want to choose
the simplest model that best explains the data. As proxies for this, look for
high log-posterior and log-likelihood values, and low AIC and WAIC.

### Validation dataset

Cloe's validation dataset (mixtures of single-cell diluted cell lines) is available within Cloe's package.

library(cloe) # data reads <- cloe_val_reads depths <- cloe_val_depths # correct clonal structure correct_genotypes <- cloe_val_Z correct_fractions <- cloe_val_F

### Learn more

For more information please refer to the html vignette and to the R documentation of methods and classes.

### Citation

Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. 2016.
A phylogenetic latent feature model for clonal deconvolution.
*The Annals of Applied Statistics*. 10(4):2377-2404.

### Contacts

Francesco Marass ( francesco.marass __ bsse.ethz.ch )