HTTPS SSH

MuClone

About

MuClone is a statistical framework for simultaneous detection and classification of mutations across multiple samples of a patient from whole genome or exome sequencing data. MuClone incorporates prior knowledge about the cellular prevalences of clones to improve the performance of detecting mutations.

Dependencies

  • numpy (tested for 1.8.0, 1.8.1 and 1.10.4 )
  • PyYAML >= 3.10
  • pandas (tested for 0.13.1, 0.14.0, 0.18.0)
  • PyDP >=0.2.3

Tutorial

Getting input data

For each sample, the following information is needed before you can use MuClone

  1. MuClone requires allelic count data from a sequencing data. You need to specify the number of reads overlapping the locus that match the reference allele and variant allele.
  2. The copy number of the genomic region containing the locus including the major and minor copy number (parental copy number.)
  3. The tumour content of the sample.
  4. Cellular prevalence information of the sample. The sequencing data can be obtained from any sequencing platform that provides digital allelic count information.

Tools that can predict parental copy number and an estimate of tumour content can be used for eliciting copy number and tumour content information.

Assuming you have derived a copy number profile for your samples, you will need to extract the copy number of the segments that contain your locus.

Tools that predict cellular prevalence information from either bulk targeted sequencing or single cell sequencing data can be used.

Input tsv (tab separated) file

Below is an example of the first two rows of one of the input files.

Position ref_counts var_counts normal_cn minor_cn major_cn variant_freq
chr1:156382063 34 3 2 0 2 0.08823
  • Mutation_id: identifier of a position across genome
  • ref_counts: the number of reads which contain the reference allele for the mutation.
  • var_counts: the number of reads which contain the variant allele for the mutation.
  • normal_cn: This is the copy number of the mutant locus for the normal cells in the sample.
  • minor_cn: minor parental copy number predicted from the tumour sample.
  • major_cn: major parental copy number predicted from the tumour sample.
  • variant_freq : The fraction of reads showing the variant allele.

/tsv is a folder which contains input files for this analysis.

Input yaml files

Next, we need to take the files in tsv/ and convert them to a format MuClone can work with.

  • position.yaml First of all, you need to convert the tsv position files into yaml files located under /yaml directory.

You need to mkdir yaml before running the above command.

The lines of the position yaml file should look like below. It includes information about positions, number of matched reference/variant alleles and their possible genotype states.

mutations:
- id: chr21:1752663
  ref_counts: 32
  states:
  - {g_n: AA, g_r: AA, g_v: AB, prior_weight: 1}
  - {g_n: AA, g_r: AA, g_v: BB, prior_weight: 1}
  var_counts: 2
- id: chr1:26245613
  ref_counts: 26
  states:
  - {g_n: AA, g_r: AA, g_v: AB, prior_weight: 1}
  - {g_n: AA, g_r: AA, g_v: BB, prior_weight: 1}
  var_counts: 12
  • config.yaml config.yaml is a template file which we will edit later to setup Muclone analysis.

In order MuClone do any analysis, we need to create one more YAML format file. This file gives MuClone information about:

  • The directory structure on the system
  • Where the file with the mutation informations reside
  • The tumour content and error rates for sequencing for each sample.
  • Prior clonal information

Here, is an example of the config file that needs to be filled out manually.

working_dir: /Users/fdorri/Documents/UBC/projects/muClone/results/DG1133
trace_dir: tmp
samples:
  P1k:
    mutations_file: yaml/P1k/P1k.yaml
    flat_cluster_files: tmp/P1k_cluster2Phi.tsv

    tumour_content:
      value: 0.44
    error_rate: 0.01
  P1i:
    mutations_file: yaml/P1i/P1i.yaml
    flat_cluster_files: tmp/P1i_cluster2Phi.tsv

    tumour_content:
      value: 0.46

working_dir is showing where the files can be found. MuClone performs all the analysis inside working_dir and all other path in config.yaml is relative to this location.

  • Input cellular prevalence file Each line of the cellular prevalence file is an indicator of a clone.

    prior phi
    0.27 0.35
    • prior: the prevalence of the clone
    • phi: the cellular prevalence of the clone

Running MuClone

Once you have the input files, running an analysis can be done using the following command.

python2.7 $code_dir/mutation_classifier/main.py analysis  --config_file $config_file

where $code_dir is the directory which you have cloned MuClone code and you need to add the path to $PYTHONPATH.

Here is the list of MuClone's argument you need to specify. config_file is the only mandatory one.

--config_file
--sample_name
--precision
--wildtype_prior
--phi_threshold
  • config file: is the config.yaml file for setup the analysis.
  • precision : it depends on your data. default for WGS data is 1000.
  • wildtype_prior: default is 0.5.
  • Phi_threshold: the threshold for distinguishing mutation clones.

Output

MuClone's output files are located at a time-stamped directory in tmp/trace/. Here is the list of output files:

MuClone-labels.tsv 
MuClone-posterior.tsv
MuClone-sample-results.tsv
parameters.txt

MuClone-sample-results.tsv is the main output file showing the result of MuClone.

References

  • Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Cote, A., and Shah, S. P. (2014). Pyclone: statistical inference of clonal population structure in cancer. Nature Methods, 11(4), 396�8.

Contact

  • fdorri@cs.ubc.ca
  • sshah@bccrc.ca