Clone wiki

pyclone / Usage

Overview

To run a PyClone analysis you need to perform several steps.

  1. Prepare mutations input file(s).

    1. Prepare .tsv input file.

    2. Run PyClone build_mutations_file --in_files TSV_FILE where TSV_FILE is the input file you have created.

  2. Prepare a configuration file for the analysis.

  3. Run the PyClone analysis using the PyClone run_analysis --config_file CONFIG_FILE command. Where CONFIG_FILE is the file you created in step 2.

  4. (Optional) Plot results using the plot_clusters and plot_loci commands.

  5. (Optional) Build summary tables using the build_table command.

Simple usage

The easiest way to run a PyClone analysis is to use the run_analysis_pipeline command. This command will run all the steps outlined above. This command requires that tab delimited input files have been generated. See the section about how to prepare these files.

Lets assume you have created three input files; A.tsv, B.tsv, C.tsv. The names of the files will be used as the sample names in the plots, though this behaviour can be overridden with the --samples flag. To run a PyClone analysis run the following command.

PyClone run_analysis_pipeline --in_files A.tsv B.tsv C.tsv --working_dir pyclone_analysis

This will create a directory pyclone_analysis. After the command completes the directory will contain several folders and the file config.yaml.

config.yaml
plots/
tables/
trace/
yaml/

The contents of these folders are as follows

  • config.yaml - This file specifies the configuration used for the PyClone analysis.

  • plots - Contains all plots from the analysis. There will be two sub-folders clusters/ and loci/ for cluster and locus specific plots respectively.

  • tables - This contains the output tables with summarized results for the analysis. There will be two tables clusters.tsv and loci.tsv, for cluster and locus specific information.

  • trace - This the raw trace from the MCMC sampling algorithm. Advanced users may wish to work with these files directly for generating plots and summary statistics.

The run_analysis_pipeline command supports several options. You can run the following command to see the complete list of options.

PyClone run_analysis_pipeline -h

More advanced usage

Each step run by the run_analysis_pipeline can be run separately. This can be useful if you want to customise the analysis in a way not supported by the run_analysis_pipeline or if you want to parallelise plotting.

The setup_analysis command can be useful to automate the preparation of the YAML format mutations files and configuration file. You can run the following command to see the usage details.

PyClone setup_analysis -h

Prepare input files

To run a PyClone analysis you need to prepare a set of properly formatted YAML file. One of these files is a meta file which points to other YAML files and contains details of the analysis. The other YAML files contain information about the read counts and genotype priors for the each sample.

TSV input file

The build_mutations_file takes a tab delimited file with a header as input and produces a YAML formatted file which can be used for running a PyClone analysis. Example files are contained in the examples/mixing/tsv folder which ships with the PyClone software.

The required fields in this file are:

  • mutation_id - A unique ID to identify the mutation. Good names are thing such a the genomic co-ordinates of the mutation i.e. chr22:12345. Gene names are not good IDs because one gene may have multiple mutations, in which case the ID is not unique and PyClone will fail to run or worse give unexpected results. If you want to include the gene name I suggest adding the genomic coordinates i.e. TP53_chr17:753342.

  • ref_counts - The number of reads covering the mutation which contain the reference (genome) allele.

  • var_counts - The number of reads covering the mutation which contain the variant allele.

  • normal_cn - The copy number of the cells in the normal population. For autosomal chromosomes this will be 2 and for sex chromosomes it could be either 1 or 2. For species besides human other values are possible.

  • minor_cn - The minor copy number of the cancer cells. Usually this value will be predicted from WGSS or array data.

  • major_cn - The major copy number of the cancer cells. Usually this value will be predicted from WGSS or array data.

If you do not major and minor copy number information you should set the minor copy number to 0, and the major copy number to the predicted total copy number. If you do this make sure to use the total_copy_number for the --prior flag of the build_mutations_file, setup_analysis and run_analysis_pipeline commands. DO NOT use the parental copy number or major_copy_number information method as it assumes you have knowledge of the minor and major copy number.

Any additional columns in the tsv file will be ignored so feel free to add additional annotation fields.

From TSV -> YAML mutations file

The information provided in this file is deliberately kept simple, but it is insufficient to run a PyClone analysis. In order to produce a file with enough information the PyClone build_mutations_file TSV_FILE command attempts to guess some details about the possible states. All states guessed by this command will be weighted equally.

The key flags which provide some control over build_mutations_file are (typing PyClone build_mutations_file -h will also print out help):

--prior

Method used to set the possible genotypes.

  1. major_copy_number - Considers all possible genotypes with up to the major copy number of B alleles.

  2. parental_copy_number - Considers all possible genotypes compatible with the predicted parental copy number.

  3. total_copy_number - Considers all possible genotypes compatible with the predicted total copy number. If reliable parental copy number is available the parental_copy_number method should be chosen.

Default is major_copy_number.

Advanced input

For more control on specifying the states or prior weights a YAML file can be directly created. The files under examples/mixing/yaml/parental_copy_number directory shows the basic format. Rather than manually trying to format the file it is highly recommended that a YAML library such as PyYAML be used to help create the files.

One top level nodes is required in the file.

  • mutations - A list of mutations an there possible states. See below.

Under the mutations node we define each mutation as an item with the following nodes.

  • id - The unique ID of the mutation.

  • ref_counts - The number of reads covering the mutation which contain the reference (genome) allele.

  • var_counts - The number of reads covering the mutation which contain the variant allele.

  • states - A list of possible states for the sample at this mutation along with prior weights.

Under the states node we define the following elements.

  • g_n - The genotype of the normal population in the state.

  • g_r - The genotype of the reference population in the state.

  • g_v - The genotype of the variant population in the state.

  • prior_weight - The relative prior weight of the state. The values will be normalised across all states to create a valid probability.

Building a PyClone configuration file

Before you can run an PyClone analysis you need to prepare a YAML formatted configuration file. This file will specify the details of the MCMC run including the number of iterations, what model you want to use, and what samples need to be included.

The entries in the configuration file will vary depending on what value you set for the 'density' entry. The following entries will be in all configuration files. For example files see the examples/mixing directory.

Common entries

Below is an example PyClone configuration file.

# Specifies working directory for analysis. All paths in the rest of the file are relative to this.
working_dir: /some/where

# Where the trace (output) from the PyClone MCMC analysis will be written.
trace_dir: trace

# Specifies which density will be used to model read counts. Most people will want pyclone_beta_binomial or pyclone_binomial
density: pyclone_beta_binomial

# Number of iterations of the MCMC chain.
num_iters: 10000

# Specifies parameters in Beta base measure for DP. Most people will want the values below.
base_measure_params:
  alpha: 1
  beta: 1

# Specifies initial values and prior parameters for the prior on the concentration (alpha) parameter in the DP. If the prior node is not set the concentration will not be estimated and the specified value will be used.
concentration:
  # Initial value if prior is set, or fixed value otherwise for concentration parameter.
  value: 1.0

  # Specifies the parameters in the Gamma prior over the concentration parameter.
  prior:
    shape: 1.0
    rate: 0.001
  • working_dir - This is the working directory for the analysis. All paths in the configuration file will be relative to this. This is mainly use to reduce the verbosity of the configuration file.

  • trace_dir - This is where the trace files for each parameter of the MCMC analysis will be written. This folder will be created if it does not exist.

  • num_iters - The number of iterations of the MCMC chain.

  • base_measure_params - This is a node which will have sub-entries depending on the density chosen. These sub-entries supply the parameters for the base measure of the Dirichlet process.

  • concentration - This is a node which has several sub-entries related to the concentration parameter (alpha) for the Dirichlet Process.

    • value - Specifies the value for the concentration parameter. If this is to be inferred this value is simply the starting value.

    • priors - This node has two sub-entries. If this node is omitted the value specified for the concentration parameter will be fixed and not inferred.

      • shape - The shape parameter in the Gamma prior on the concentration parameter. A reasonable value seems to be 1.0.

      • rate - The rate parameter in the Gamma prior on the concentration parameter. A reasonable value seems to be 0.001.

  • samples - This is a node which will contain one or more sub-entries which specify details about the samples used in the analysis. Each sub-entry is a node which contain a minimum of a unique sample ID and a path to TSV or YAML format input file. For the genotype naive densities, gaussian, binomial, and beta-binomial the input files follow the simple tsv format outline above. Copy number information will be ignored for these methods. For the pyclone_binomial or pyclone_beta_binomial methods you need to pass YAML formatted mutations files.

A sample input for genotype naive (binomial, beta_binomial, gaussian) method is

samples:
  # Unique sample ID
  SRR385938:
    # Path where tsv formatted mutations file for the sample is placed.
    mutations_file: tsv/SRR385938.tsv

and for a PyClone (pyclone_binomial, pyclone_beta_binomial) method is

samples:
  # Unique sample ID
  SRR385938:
    # Path where YAML formatted mutations file for the sample is placed.
    mutations_file: yaml/parental_copy_number/SRR385938.yaml

    tumour_content:
      # The predicted tumour content for the sample. If you have no estimate set this to 1.0.
      value: 1.0

    # Expected sequencing error rate for sample
    error_rate: 0.001

Run PyClone

Once the required YAML file has been created PyClone can be run using the PyClone run_analysis CONFIG_FILE command. To see the possible flags for the command run PyClone run_analysis -h. The run_analysis command will run a full MCMC analysis of the data writing the results to several files in the specified output directory. All files are bz2 compressed tab separated files, so they can easily be read by external tools.

Post-process results

PyClone provides some convenience methods for post-processing the raw trace files from the PyClone MCMC sampler. These tools are useful for getting a quick feel for the data, but for publication quality plots you will likely need to work with the raw trace produced by PyClone.

If there is a plotting feature which would be generally helpful, feel free to make a feature request on the issues tracker.

In the following CONFIG_FILE refers to the file used with the PyClone run_analysis command, OUT_FILE is the path where the plot file will be written.

About burnin and thin

PyClone uses MCMC sampling to approximate the posterior distribution of the model. A common idea in MCMC analysis is the need to burnin and thin samples.

The burnin defines how samples will be discarded from the start of the MCMC samples. This is done to allow the sampler to reach the true posterior density. A reasonable rule of thumb is to set the burnin to 10% of the length of the MCMC run. For example if you ran the sampler for 10,000 iteration you would discard the first 1,000 samples.

The thin defines how many samples to skip after burnin. For example if thin was 10, only every 10th post burnin sample would be used. There is some argument that this decreases correlation between samples, there is another field of thought which says that thinning is not useful. In general leaving thin at the default value of 1, which uses every post-burnin sample, is recommended. If you find the post-processing commands are running very slowly you may want to increase thin since it reduces computational burden.

Plotting posterior cellular prevalence densities

To plot the posterior density of the cellular frequencies use

PyClone plot_loci --config_file CONFIG_FILE --plot_file PLOT_FILE --plot_type density

See the help, PyClone plot_loci -h, for command options.

Plotting the posterior similarity matrix

To output the posterior similarity matrix which shows how often mutations where sampled to be in the same cluster use

PyClone plot_loci --config_file CONFIG_FILE --out_file OUT_FILE --plot_type similarity_matrix

See the help, PyClone plot_loci -h, for command options.

Plotting multiple sample parallel coordinate plots

To plot the mean cellular frequencies of mutations colour coded by cluster ID use the

PyClone plot_clusters --config_file CONFIG_FILE --plot_file OUT_FILE --plot_type parallel_coordinates

See the help, PyClone plot_clusters -h, for command options.

Build loci results table

PyClone provides a method to output the mean cellular prevalence across each sample and the cluster id for each mutation. This can be used to generate various plots using your own code.

To get this table use the

PyClone build_table --config_file CONFIG_FILE --table_file OUT_FILE --table_type loci

command.

See the help, PyClone build_table -h, for command options.

Updated