1. Andrew Roth
  2. pyclone

Wiki

Clone wiki

pyclone / Tutorial

UNDER CONSTRUCTION

Overview

In this tutorial we will run a PyClone analysis on four samples generated by mixing healthy normal tissue from four cases from the 1000 genomes project. This tutorial assumes you are using PyClone 0.12.3. Most of this tutorial will apply to the older 0.12.x versions, though some functions may not exist.

This tutorial also assumes you are working in a Unix like environment. I have tested it on Linux, and it should work on Mac though I can't guarantee that. It almost certainly will not work for Windows and there will likely never be support in PyClone for Windows.

A Note About Nomenclature

Historically PyClone has referred to the proportion of cancer cells which contain a mutation as the cellular frequency of the mutation. Recent publications in the field have referred to the same quantity as the cellular prevlance of the mutation. In this document we will generally use prevalence, but some of the PyClone functions use frequency.

Getting Input Data

PyClone requires that you generate prior information about the genotype of each mutation in the sample. In principle this depends on the algorithm(s) you use for copy number inference, tumour content prediction and your own beliefs. In practice specifying priors is challenging so PyClone includes some functionality to automatically generate the required prior information based on simpler input information.

Before you can use PyClone you will need the following information for each mutation in the sample.

  1. Allelic count data from a sequencing experiment. PyClone requires that you specify the number of reads overlapping the mutation which contain the reference allele and the number of reads which contain the variant allele.

  2. The copy number of the genomic region containing the mutation. PyClone can work with either predictions of total copy number or parental (allele specific) copy number. In general performance will be better if you can specify parental copy number.

In addition to above data for each mutation, PyClone can also use an estimate of the tumour content of the sample. The tumour content is not strictly required, but if it is not passed performance will be worse. In addition the estimates of cellular prevalence of the mutations will need to be interpreted differently.

  • If you specify the tumour content, then the cellular prevalence estimates represent the fraction of cancer cells which contain the mutation.

  • If you do not specify the tumour content, which in practice means setting it to 1.0, then the cellular prevalence estimates represent the fraction of all cells in the sample which contains the mutation. This means that the highest estimated cellular prevalence will be less then or equal to the true tumour content.

Note : If you have estimates of tumour content use them, otherwise the inferred cellular prevalences may exceed the predicted tumour content.

Where To Get The Data

The sequencing data can be obtained from any sequencing platform which provides digital allelic count information. For examples: HiSeq, MiSeq, 454 and Ion Torrent sequencers would all provided the required data. The data will of course need to be aligned and the allelic counts extracted.

In principle whole genomes shotgun sequencing (WGSS) or exome capture sequencing data could be used. In practice the depth of these approaches will be to low for an accurate PyClone analysis. The preferred approach is to use deep sequence data acquired by targeted amplicon sequencing or custom capture arrays.

The copy number information can be elicited from either arrays such as the Affymetrix SNP6.0 platform, or from WGSS data. Computational tools will need to be applied in either case to infer the copy number profiles of the genomes. Tools which can predict parental copy number are ideal, and even better are tools which also provide an estimate of tumour content.

Assuming you have derived a copy number profile for your samples, you will need to extract the copy number of the segments which contain your mutations.

Tutorial Data

For this tutorial we will use data from a mixture of normal tissue samples. There are four samples in this dataset which were generated by mixing tissues from four 1000 genomes cases in varying proportions. The data was deeply sequenced using the Illumina MiSeq platform.

Positions in which only one of the cases has a variant genotype (AB or BB) are included in this dataset. Conceptually this is equivalent to a sample with four populations of diploid clones, which share no mutations. Because we excluded sex chromosomes, the total copy number of all positions is 2. For this dataset we still need to get the parental copy number for mutation. This can be done since we have the predicted genotype of the variants (AB or BB). In the case the mutation is AB the major and minor copy numbers would both be 1. In the case the genotype was BB the major copy number would be 2 and the minor copy number would be one.

Below is an example of the first two rows of one of the input files, SRR385938.tsv. The input files are located under the tsv/ folder.

mutation_id ref_counts var_counts normal_cn minor_cn major_cn variant_case variant_freq genotype
NA12156:BB:chr2:175263063 3812 14 2 0 2 NA12156 0.0036591741 BB

Most PyClone analyses will start from one or more files like this. The first row is the header, the subsequent rows correspond to mutations.

There are 6 mandatory fields in for the input file.

  1. mutation_id : This is unique identifier for a mutation. In general specifying the gene for the mutation is a bad idea in case a gene contains multiple mutations. Usually some combination of gene name and genomic coordinates is a good choice. In this case I have used the case with the variant, the genotype of mutation in the variant case and the genomic coordinates.

  2. ref_counts : This is the number of reads which contain the reference allele for the mutation.

  3. var_counts : This is the number of reads which contain the variant (mutant) allele for the mutation.

  4. normal_cn : This is the copy number of the mutant locus for the normal cells in the sample. In most cases this will be 2, with the following exceptions (there may be some others I haven't considered).

    i. If the sample is from a male and the mutation is on a sex chromosomes (X or Y) you would expect the normal cells to have copy number 1.

    ii. If the normal tissue has a germline copy number variant you would need to set the copy number to the predicted value. The only way to get this is to run a copy number analysis on normal tissue from the same donor.

  5. minor_cn : This is the minor parental copy number predicted from the tumour sample.

  6. major_cn : This is the major parental copy number predicted from the tumour sample.

Note : We cannot usually tell what allele is maternal or paternal, so we use minor and major copy number instead. The convention is that the major copy number is the larger of the two values.

Note : If you only have total copy number for the tumour, not parental copy number, you can set the minor_cn to 0 and the major_cn to the predicted total copy number. When we use the PyClone build_mutations_file file command you will need to set the flag --var_prior total_copy_number in this case. By default the command assumes parental copy number information is being passed.

This files also contains 3 additional fields. PyClone will ignore these fields during analyses, and they are only useful for your own annotations. Any fields beyond the 6 mandatory ones mentioned above will be ignored by PyClone.

For reference the additional 3 fields in this file are

  1. variant_case : The 1000 genomes ID of the case which contains the variant genotype for this mutation.

  2. variant_freq : The fraction of reads showing the variant allele.

  3. genotype : The genotype of the case with the variant.

Building A Genotype Prior File

Once you have downloaded the data and installed PyClone you will need to unpack the data somewhere using

tar -zxvf tutorial.tar.gz

This will create a directory called tutorial which you can enter using cd tutorial. There is one file and one folder in this directory.

  • config.yaml is template file which we will edit later to setup a PyClone analysis.

  • tsv/ is a folder which contains input files for this analysis.

The first thing we will need to do is take the files in tsv/ and convert them to a format PyClone can work with. Specifically we need to specify the possible states of each mutation and their prior probabilities. The state of a mutation is combination of genotypes for the normal, reference and variant populations. PyClone includes a command build_mutations_file for this purpose, which will read the simple .tsv files and output a file with the state priors.

To do this we will create a folder for the output.

mkdir yaml

Now we will use the build_mutations_file command.

PyClone build_mutations_file tsv/SRR385938.tsv yaml/SRR385938.yaml

PyClone build_mutations_file tsv/SRR385939.tsv yaml/SRR385939.yaml

PyClone build_mutations_file tsv/SRR385940.tsv yaml/SRR385940.yaml

PyClone build_mutations_file tsv/SRR385941.tsv yaml/SRR385941.yaml

This will generate the files for each of the four samples. There are some options for the build_mutations_file command, which you can see by typing

PyClone build_mutations_file --help

For this tutorial the defaults will suffice. The first few lines of the file yaml/SRR385938.yaml look like

mutations:
- id: NA12156:BB:chr2:175263063
  ref_counts: 3812
  states:
  - {g_n: AA, g_r: AA, g_v: AB, prior_weight: 1}
  - {g_n: AA, g_r: AA, g_v: BB, prior_weight: 1}
  var_counts: 14
- id: NA12156:BB:chr1:46500613
  ref_counts: 3933
  states:
  - {g_n: AA, g_r: AA, g_v: AB, prior_weight: 1}
  - {g_n: AA, g_r: AA, g_v: BB, prior_weight: 1}
  var_counts: 42

This file uses the YAML format, which a machine parsable format like XML. The first line

mutations:

specifies we are listing out the mutations for this sample. The next line

- id: NA12156:BB:chr2:175263063

specifies we are starting an entry for a mutation with the id NA12156:BB:chr2:175263063. Note the - indicates were are starting a new mutation entry.

The lines

ref_counts: 3933

and

var_counts: 14

specify the reference and variant counts for the mutations.

The state priors are specified by

states:
  - {g_n: AA, g_r: AA, g_v: AB, prior_weight: 1}
  - {g_n: AA, g_r: AA, g_v: BB, prior_weight: 1}

Note there is an entry (started by -) for each possible state we believe is possible. For each state we specify 4 values:

  1. g_n : A string representing the genotype of the cells in the normal sub-population.

  2. g_r : A string representing the genotype of the cells in the reference sub-population. Recall the reference sub-populations consists of all cancer cells which lack the mutation. Thus valid genotypes here should not contain a B allele. For examples A, AA, AAA, AAAAAAAA are all valid but B, AB, BB, AAAAB are invalid.

  3. g_v : A string representing the genotype of the cells in the variant sub-population. Recall the variant sub-populations consists of all cancer cells which posses the mutation. Thus valid genotypes should contain at least one B allele. For example B, AB, BB, ABBB are all valid but A, AA, AAB are not valid.

  4. prior : The relative prior belief you have in this state versus the others specified. This value will normalised so that the sum over the prior beliefs for all states for a mutation equals 1.0.

PyClone won't warn you about invalid entries so you need to be careful if you create the yaml files by hand. However, the build_mutations_file will always create valid files.

If you really understand the PyClone model you can write the yaml files yourself without using build_mutations_file. This means that you could include germline mutations or LOH events. You can also re-purpose the model for other applications.

Building An Analysis File

Before PyClone can perform an analysis we need to create one more YAML format file. This file tells PyClone

  • The directory structure on the system

  • Where the file with the mutation informations reside

  • The model which we want to run

  • The tumour content and error rates for sequencing for each sample were are going to analyse

  • Various parameter settings for the model.

Their is no helper function included with PyClone so this file will have to be created manually.

The Template File

The tutorial data includes a file config.yaml which will be the template we start from. The file looks like

num_iters: 10000

base_measure_params:
  alpha: 1
  beta: 1

concentration:
  value: 1.0

  prior:
    shape: 1.0
    rate: 0.001

The first line

num_iters: 10000

sepcifies the number of MCMC iterations PyClone will perform for the analysis. More iterations will lead to a more accurate estimate of the posterior distribution at the cost of more computational effort.

Note: One way to check if you have run a sufficient number of iterations is to perform the analysis twice from different initialisations (default if you don't set the --seed flag with the analyse command). If the results are the same then it is likely that enough iterations have been run. If not you will need to re-run for more iterations.

Note: In my experience 10,000 iterations with a burnin of 1,000 seems sufficient for convergence. Burnin is set during the post-processing phase.

The next lines

base_measure_params:
  alpha: 1
  beta: 1

specify the prior distribution on the base measure for the clonal frequencies. The default values are uniformly spread over [0, 1]. Most users will not need to change these default values.

The next few lines

concentration:
  value: 1.0

  prior:
    shape: 1.0
    rate: 0.001

set the initial value and priors for the concentration parameter in the Dirichlet Process (DP). Again most users will not need to edit these and the defaults should suffice.

If the lines

  prior:
    shape: 1.0
    rate: 0.001

are omitted the concentration parameter will not be inferred and the value specified by the line

  value: 1.0

will be used for the analysis.

Specifying The Model

The first thing we will do is add an entry to specify which density (model) we want to use. There are several options

  • gaussian : This option will use an infinite Gaussian mixture model (IGMM) fit to the variant allelic frequencies.

  • binomial : This option will use an infinite Binomial mixture model (IBMM) fit to allelic count data.

  • beta_binomial : This option will use an infinite Beta Binomial mixture model (IBBMM) fit to allelic count data.

  • pyclone_binomial : This option will use an the PyClone density with a Binomial emission.

  • pyclone_beta_binomial : This option will use an the PyClone density with a Beta Binomial emission. This will be the choice most users will want.

We will use the PyClone model with the Beta Binomial emission, to specify this add following line to the config.yaml file.

density: pyclone_beta_binomial

Since we are using the Beta Binomial emission model we will need to specify some priors over the precision of the distribution. We will add the following lines to the file to do this.

beta_binomial_precision_params:
  # Starting value
  value: 1000

  # Parameters for Gamma prior distribution
  prior:
    shape: 1.0
    rate: 0.0001

  # Precision of Gamma proposal function for Metropolis Hastings step
  proposal:
    precision: 0.01

In general these values should work for most users.

Specifying Directory Structures and File Locations

Now we will need to tell PyClone where the files can be found. To keep the configuration file more readable PyClone specifies a working directory where all analyses will be performed, every other path in the config.yaml file is relative to this location.

To do this add the line

working_dir: /path/to/tutorial/directory

where /path/to/tutorial/directory is the location you unpacked the tutorial data.

Now we will tell PyClone where to place the raw output of the MCMC sampler. To do this add the following line

trace_dir: trace

which will cause PyClone to place the output in /path/to/tutorial/directory/trace.

Specifying Sample Information

Next we need to tell PyClone where to find the files containing the information about the mutations for each sample, the tumour content for each sample and the error rate of the sequencing instrument used to generate the data for the sample.

We can do this by adding the following lines

# Specify the samples for the analysis.
samples:
  # Unique sample ID
  SRR385938:
    # Path where YAML formatted mutations file for the sample is placed.
    mutations_file: yaml/SRR385938.yaml

    tumour_content:
      # The predicted tumour content for the sample. If you have no estimate set this to 1.0.
      value: 1.0

    # Expected sequencing error rate for sample
    error_rate: 0.001

  SRR385939:
    mutations_file: yaml/SRR385939.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

  SRR385940:
    mutations_file: yaml/SRR385940.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

  SRR385941:
    mutations_file: yaml/SRR385941.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

The line

samples:

states the next block of the file pertains to the sample configuration. For each sample we will have an entry like

  # Unique sample ID
  SRR385938:
    # Path where YAML formatted mutations file for the sample is placed.
    mutations_file: yaml/SRR385938.yaml

    tumour_content:
      # The predicted tumour content for the sample. If you have no estimate set this to 1.0.
      value: 1.0

    # Expected sequencing error rate for sample
    error_rate: 0.001

The first part

  SRR385938:

specifies the unique ID of the sample.

The next line

    mutations_file: yaml/SRR385938.yaml

specifies that the mutation data file for this sample is at /path/to/tutorial/directory/yaml/SRR385938.yaml.

The next line

    tumour_content:
      # The predicted tumour content for the sample. If you have no estimate set this to 1.0.
      value: 1.0

specifies the tumour content of the sample.

Since these samples are an artificial mixture of healthy tissue, the tumour content is actually 1.0.

The final line

    error_rate: 0.001

specifies the sequencing error rate for the sample. This can vary between samples, though it likely will be the same. If you don't know this value you can use the one here as it has a small impact on inference provide the value is not extremely large like 0.01 - 1.0. A value of 0.001 says that we expect 1 in every 1,000 reads to incorrectly report an allele.

Note: Not every sample needs to contain the same set of mutations. In the event that there are different mutations between samples, PyClone will restrict the analysis to the subset of mutations present in all samples.

Final Config File

Your configuration files should now look like

num_iters: 10000

base_measure_params:
  alpha: 1
  beta: 1

concentration:
  value: 1.0

  prior:
    shape: 1.0
    rate: 0.001

density: pyclone_beta_binomial

beta_binomial_precision_params:
  value: 1000

  prior:
    shape: 1.0
    rate: 0.0001

  proposal:
    precision: 0.01

working_dir: /path/to/tutorial/directory

trace_dir: trace

samples:
  SRR385938:
    mutations_file: yaml/SRR385938.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

  SRR385939:
    mutations_file: yaml/SRR385939.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

  SRR385940:
    mutations_file: yaml/SRR385940.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

  SRR385941:
    mutations_file: yaml/SRR385941.yaml

    tumour_content:
      value: 1.0

    error_rate: 0.001

Running An Analysis

Once you have the configuration file specified running an analysis can be done using the following command.

PyClone analyse config.yaml

The only option the analyse command has is the --seed flag. This sets the random seed for the analysis so that you can reproduce the results of your run. If this not set PyClone will use a random value for the seed.

Note: The analysis will take a while (~10 minutes on my computer). The memory footprint is small (<100 MB) for this analysis so you should be able to use a Desktop computer.

Post-Processing The Results

Clustering The Results

The command

PyClone cluster config.yaml cluster.tsv --burnin 1000

will output a clustering file, cluster.tsv. This file has two columns

  1. mutation_id : The unique ID of the mutation.

  2. cluster_id : The ID of the cluster the mutation was assigned to. The cluster IDs are only meaningful in the sense that two mutations which share the same cluster ID belong in the same cluster. The actual numerical values have no meaning.

You will notice we specified the option --burnin 1000. This tells the command to discard the first 1,000 samples from the MCMC trace as burnin. Inference will then be based on the remaining 9,000 samples. All subsequent post-processing commands will use the same flag.

Plotting The Similarity Matrix

The posterior similarity matrix shows how frequently two mutations appeared in the same cluster in the post-burning MCMC trace. Values near 1.0 indicate the mutations likely cluster together, whereas values near 0.0 indicate the mutations are likely in separate clusters. The hard clustering generated by the cluster command uses this matrix internally, though this matrix is potentially more informative as it reflects uncertainty.

The command for this is

PyClone plot_similarity_matrix config.yaml similarity_matrix.pdf --burnin 1000

which creates the file similarity_matrix.pdf in pdf format.

If you change the extension of the file to .svg the file will be output in svg format. This should also works for .png and .jpg.

Plotting The Posterior Cellular Frequency Densities

The posterior density of the cellular frequencies show the probability that a mutation has a given cellular frequency in the sample.

Since we have multiple samples, there will be multiple plots (one per sample). We need to create a folder to output these plots in.

Note: Even if you only use a single sample the plot_cellular_frequencies takes a folder as the output argument.

mkdir cellular_frequencies

Now we can run the plotting command pointing the output to the folder we created.

PyClone plot_cellular_frequencies config.yaml cellular_frequencies/ --burnin 1000

Plotting The Multi-Sample Parallel Coordinates

To plot a summary of the clustering and cellular frequencies across all samples you can use the following command

PyClone plot_multi_sample config.yaml multi_sample.pdf

This command will create a parallel coordinates plot, where the mean value of each cluster is plotted across the samples and connected by a line whose width is proportional to the number of mutations in the cluster.

If you would like to see each mutation plotted as a separate line, but still colour coded by cluster, use the following command.

PyClone plot_multi_sample config.yaml multi_sample.pdf --separate_lines

Note: It may look like there are fewer lines then expected, this is because the mean cellular frequencies for many mutations are identical.

It can be useful to visualise the underlying variant allelic frequencies as well. To plot the allelic frequencies of each mutation across the samples, colour coded by cluster, use the following command.

PyClone plot_multi_sample config.yaml multi_sample.pdf --separate_lines --prevalence allelic

Note: The plot_multi_sample will fail if there are more then 12 clusters identified. The reason is that the colour palette does not support more than 12 colours. If you run into this problem you can output the underlying results using the build_table and generate your own plots.

To get the table used to generate this plot use the command

PyClone build_table config.yaml results.tsv

Publication Quality Plots

PyClone includes various plotting commands with the hope that they will be useful. They are limited however, and getting publication quality plots may require some more work.

Note: If there is some plotting functionality you think would be generally useful, feel free to open a feature request under the issue tracker. I will try to address it in subsequent versions.

If the built in commands are insufficient for your needs, you can always extract the underlying data from the files in trace directroy. If you look in the trace folder

ls -l trace

you should see the following

-rw-r--r-- 1 andrew andrew  76687 Jun 25 12:42 alpha.tsv.bz2
-rw-r--r-- 1 andrew andrew   7049 Jun 25 12:42 labels.tsv.bz2
-rw-r--r-- 1 andrew andrew  74343 Jun 25 12:42 precision.tsv.bz2
-rw-r--r-- 1 andrew andrew 202990 Jun 25 12:42 SRR385938.cellular_frequencies.tsv.bz2
-rw-r--r-- 1 andrew andrew 198842 Jun 25 12:42 SRR385939.cellular_frequencies.tsv.bz2
-rw-r--r-- 1 andrew andrew 138864 Jun 25 12:42 SRR385940.cellular_frequencies.tsv.bz2
-rw-r--r-- 1 andrew andrew 209196 Jun 25 12:42 SRR385941.cellular_frequencies.tsv.bz2

The file labels.tsv.bz2 contains the cluster labels of each mutation for each iteration of the MCMC chain. The file is a bz2 compressed text file, which is tab delimited. The first row is a header which lists each mutation, and the subsequent rows are the samples from each iteration. Each column corresponds to the labels sampled for the respective mutations.

The files *.cellular_frequencies.tsv.bz2 contain the sampled cellular frequencies for each mutation. They follow the same format as the labels file and should be straightforward to parse.

The files alpha.tsv.bz2 and precision.tsv.bz2 contain the MCMC traces for the DP concentration parameter and Beta Binomial precision parameter. These are probably only interesting for debugging purposes, or for power users with a deep understanding of the model and parameter meanings.

Getting Help

Every PyClone command can take the argument --help which will print a list of options. Try the following

PyClone --help

PyClone analyse --help

PyClone build_mutations_file --help

If you have any questions about the software or the model the user group is a good place to ask them. The issues tracker at the top of the page is a good place to post any bugs or feature requests.

Updated