Wiki

Clone wiki

invar / Documentation

Home / Methods / Documentation / Source / Contact / Rosenfeld Lab

This repository contains code for detection and quantification of ctDNA from plasma samples where a list of patient-specific mutations is already known using the INVAR algorithm.



What is the INVAR pipeline?

INVAR uses error-suppressed sequencing data, and weights and aggregates signal across up to thousands of patient-specific loci. Detailed background can be found here.

The pipeline is composed of a series of shell scripts, which run other parallelised jobs via Slurm on a cluster, based on bash, R and Python scripting. The pipeline is run in stages, and takes up to 1-2 days to run on a cluster depending on the size of the input data.


Requirements

The INVAR code is optimised to run in parallel on a Slurm cluster, with the following software:

  • slurm 17.11.4
  • samtools 1.9
  • htslib 1.7
  • bcftools 1.6
  • picard 2.10.2
  • bedtools 2.27.1
  • R 3.6.1, with the following libraries:
    • data.table 1.12.8
    • dplyr 0.8.5
    • ggplot2 3.3.0
    • ggpubr 0.2.5
    • plotROC 2.2.1
    • plyr 1.8.6
    • readr 1.3.1
  • Python 2.7.5, with the following libraries:
    • argparse 1.1
    • csv 1.0
    • pysam 0.9.0

The following reference files were used in our analysis. Similar files are obtainable online (these exact files are available upon request):

  • UCSC hg19 fasta
    • hg19/ucsc.hg19.fasta
  • Bedtools .genome file of hg19 fasta
    • bedtools_hg19/hg19.genome
  • 1000 genomes database for annotation of SNPs
    • 1000genomes/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
  • COSMIC database for annotation of mutations
    • /scratcha/nrlab/resources/COSMIC/v82/CosmicCodingMuts.vcf.gz

The paths to software and reference files must be defined in config/config.sh, please see Setting up config files.


Input files

The INVAR pipeline requires error-suppressed BAM files, a patient-specific mutation list (as a BED and as a CSV), and sample details as a CSV. Example files can be found in the example_files/ folder of the repository, shown here.

  • BED file of patient-specific mutations
    • BED file containing all patient-specific locations, annotated with REF and ALT bases. An example file can be found in example_files/example.bed. Columns required, in order:
    • chromosome, position-1, position, ref base, alt base.
  • TXT file with path to bam files
    • Please generate a text file to point to all the bam files to be run in the INVAR analysis. One file per line. If all your files are present (or symbolically linked) in the input/ folder, you can generate the text file, as follows: input/*bam > to_run.txt. An example file can be found in example_files/to_run.txt.
  • CSV of sample details (SLX_table)
    • Table containing sample details, e.g. sequencing lane ID (SLX) and barcode. An example file can be found in example_files/example_SLX.csv. Example values are given in square brackets, and are left blank where optional to fill in. Although columns can be left blank, entire columns should not be skipped. Columns required, in order:
    • study_name (string) [MELR]
    • sample_name (string) [timepoint1]
    • sample_description []
    • further_comments []
    • sample_type ["plasma_DNA", "urine_DNA", "xxxx"]
    • case_or_control ["case", "control"]
    • SLX_ID (string to identify sequencing lane) ["SLX-10000"]
    • barcode (string to identify barcode) ["D705_D501"]
    • input_into_library_ng []
  • CSV of patient-specific mutation list (TUMOUR_MUTATIONS_CSV)
    • This CSV contains tumour mutation information for subsequent analysis steps that are performed in R. An example file can be found in example_files/tumour.mutations.csv. Columns required, in no particular order:
    • REF ["A", "C", "G", "T"]
    • ALT ["A", "C", "G", "T"]
    • uniq_pos (string in chrX:XXXX format, 0-based) ["chr1:11111"]
    • uniq_alt (string, combining uniq_pos with REF and ALT, separated by a "/") ["chr1:11111_A/T"]
    • Patient (string) ["Patient1", "Patient222", "Patient52"]
    • pt_uniq_pos (string, combine Patient and uniq_pos with an underscore, i.e. the mutation belonging to this patient) ["Patient222_chr1:11111"]
    • tumour_AF (numeric, tumour allele fraction between 0 and 1) [0.2]

The paths to each of these input files must be defined in config/config.sh, please see Setting up config files.


Setting up config files

Next, both config.sh and config2.R scripts must be updated before starting the pipeline. These are found in the config/ folder of the cloned pipeline, and start with default values that were used in the INVAR paper.

config.sh
This config file is read by each shell script in the pipeline, and the variables are passed to the subsequent scripts. Variables to be defined by the user for each project:

  • STUDY_ID (string, any string allowed)
  • ERROR_SUPPRESSION_NAME (string, we have used "f0.9_s2", indicating a minimum error-suppression family size of 2, with a minimum consensus of 90% between reads
  • BED (path to BED containing patient-specific mutations)
  • MAPQ (numeric, minimum mapping quality for pileup, default = 40)
  • BASEQ (numeric, minimum base quality for pileup, default = 20)
  • INPUT_FILES (path, "to_run.txt")
  • LIBRARY_PREP (string, "Rubicon")
  • SLX_TABLE (path to CSV listing sequencing information i.e. SLX lane ID, and barcode)
  • TUMOUR_MUTATIONS_CSV (path to tumour mutations CSV)

Paths to software to be defined if used on a new cluster:

  • MPILEUP
  • SAMTOOLS_INDEX (path to samtools, followed by index)
  • BCFTOOLS
  • PICARD (path to picard.jar)
  • BEDTOOLS

Paths to reference files to be defined if used on a new cluster:

  • FASTAREF (UCSC hg19 fasta file)
  • HG19_GENOME (hg19.genome file, which can be generated by bedtools from FASTAREF)
  • K1G_DB (1000 genome database in vcf.gz)
  • COSMIC_DB (COSMIC database in vcf.gz)

Other variables to consider modifying depending on your cluster or data set:

  • CHUNKS (numeric, default = 5)
    • Number of chunks to split the mpileup VCF prior to annotation into, in order to parallelise this step.
  • MIN_DP (numeric, default = 5)
    • Minimum DP for mpileup, note that there is a separate minimum DP that is used as a filter that is set in config2.R.
  • SLOP_BP (numeric, default = 10)
    • Base pairs around the loci given in the patient-specific BED file to be used to determine background error rates per trinucleotide.
  • SMOOTH (numeric, default = 0.25)
    • The fragment size profile is smoothed with a kernel density estimator (see Supplementary Methods) to give a smoothed size profile with which to weight signal from each read.
  • MAX_DP (numeric, default = 100000)

config2.R
This config is for use by R scripts in the pipeline. The variables that can potentially be modified are as follows:

  • max_DP (default = 2000)
  • min_ref_DP (default = 10)
    • Require a minimum of N reference reads at a locus, otherwise exclude locus
  • individual_MQSB_threshold (default = 0.01)
    • Data points with MQSB (mapping quality/strand bias) less than this threshold will be excluded (1 data point = 1 locus in 1 sample).
  • n_alt_alleles_threshold (default = 3)
    • Any loci showing error-suppressed families with all 3 non-ref bases are classified as noisy and excluded.
  • minor_alt_allele_threshold (default = 2)
    • Multi-allelic loci that have a minor alternate allele with >= 2 error-suppressed families supporting the minor alt allele are classified as noisy loci and excluded.

Important notes

Please note the following when running the INVAR pipeline:

  • This script requires a cluster that runs Slurm. If you wish to run on an alternative job scheduler, please edit each of the shell scripts to use an alternative to sbatch.
  • Please tail .err and .out files to ensure that each step has run correctly. Please check the sbatch -o and -e command for each of the shell scripts to see where these output files are saved.
  • Please do not combine samples generated with different library preparation methods in the same run of the pipeline. Also, do not combine data from different body fluids in the same run. This is because background error rates are determined from each of the samples from near-target bases, and so only samples of the same type (and therefore, error rate) should be combined.
  • It is essential that your paths in config.sh and the settings in config2.R are set before running any of the pipeline steps. Do not run separate parts of the pipeline with different config settings as this will likely cause errors in file naming syntax.

Please note the following when running the INVAR pipeline on sWGS:

  • In config.sh change MIN_DP to 1, allowing to analyse all the loci with signal
  • In config2.R change min_ref_DP to 0. In sWGS data it is quote common that at a given locus only one mutant read is present without any normal reads
  • In config2.R change is.blood_spot to T. This omits outlier-suppression on samples with deduplicated depth of <5x because high AF loci cannot be reliably identified with low depth sequencing data

Command-line usage

The INVAR pipeline is split into multiple executable shell scripts, grouped into steps. Please first read the important notes above before attempting to run any parts of the pipeline. Details on the theory behind INVAR can be found on the Methods page.

Step 1
Perform the following: mpileups, annotation with trinucleotide context, COSMIC mutation status, 1000 Genomes SNP information. This is followed by reading into R, where data is parsed and is annotated with sample details, and background error rates are determined after performing data filtering.

  • INVAR1.sh
    • Generates pileups on each BAM that is run (based on the input BED), followed by annotation with additional features.
    • To run: ./bin/step1_parse/INVAR1.sh.
    • out and err files are in temp folder [NAME].pileups.out and [NAME].pileups.err
  • INVAR2.sh
    • Takes the individual mpileup files generated by INVAR1.sh and combines them into one file that is easily readable by R which is called in INVAR3.sh.
    • To run: ./bin/step1_parse/INVAR2.sh
    • out and err files are in temp folder INVAR2.out and INVAR2.err
  • INVAR3.sh
    • Splits the dataframe into on and off-target read data.frames based on patient-specific loci provided in the input CSV.
    • Generates an error rate data.frame based on control samples at all patient-specific loci. This pipeline requires at least 10-20 separate individuals run with the same panel in the same cohort in order to adequately control for locus-specific noise.
    • Filters loci for mapping quality (MQ), strand bias, low depth (DP).
    • Presently excludes indels. Inclusion of patient-specific indel analysis is ongoing.
    • Saves a dataframe of background error rates determined in the near-target window.
    • To run: ./bin/step1_parse/INVAR3.sh.
    • out and err files are in output_R folder INVAR3.out and INVAR3.err
  • INVAR4.sh
    • Annotate data with sample information from CSV and parse data.frame
    • Split data into patient-specific and non-patient-specific (locus does not belong that individual) based on patient-specific mutation list.
    • Annotate with background error rates generated in INVAR3.sh.
    • To run: ./bin/step1_parse/INVAR4.sh.
    • This step generates an .rds file of each locus and the signal per locus. A small example output can be found here: example_files/INVAR4_example_output.rds.
    • out and err files are in output_R folder INVAR4.out and INVAR4.err

Step 2
This step generates size information for each error-suppressed read family (representing one cfDNA fragment), then annotates each data point generated from step 1 with individual molecule fragment size data. The data in step 1 consists of data.frames with information on a per locus-level, whereas after merging with size information, data.frames will contain information per molecule. The data.frames generated in this step are much longer, and so each sample is split into its own data.frame and processed in parallel.

  • INVAR_SIZE_ANN1.sh
    • For each of the BAM files used in INVAR1.sh, the fragment size of mutant and wild-type reads is determined for each molecule overlying patient-specific loci listed in the BED file provided by the user.
    • To run: ./bin/step2_size_ann/INVAR_SIZE_ANN1.sh.
    • out and err files are in temp_size folder [NAME].get_fragment_size.out and [NAME].get_fragment_size.err
  • INVAR_SIZE_ANN2.sh
    • Annotate the output generated from INVAR4.sh with the size information generated by INVAR_SIZE_ANN1.sh. The initial data.frame starts with each row representing one locus, whereas after this step, each row represents one molecule and is split by sample (and is saved in output_R/combined.polished.size/).
    • There will be multiple non-patient-specific samples generated for each input BAM if a BED file containing the mutations from multiple patients was used, since each individual can control for other individual (so long as they do not share the same mutation).
    • To run: ./bin/step2_size_ann/INVAR_SIZE_ANN2.sh.
    • out and err files are in temp_size folder [NAME].os_[setting].INVAR_SIZE_MERGING.out and [NAME].os_[setting].INVAR_SIZE_MERGING.err

Step 3
This step consists of patient-specific outlier-suppression, which identifies noise based on knowledge of a large number of patient-specific loci. In addition, characterisation of fragment sizes is carried out in order to generate an Rdata object that is later used to weight signal by fragment sizes.

  • OUTLIER_SUPRESSION.sh
    • Filter out loci that show outlying signal relative to the remainder of the patient-specific loci.
    • A default Bonferroni-corrected P-value threshold is set in config2.R of 0.05.
    • To run: ./bin/step3_outlier_suppression/INVAR_OUTLIER_SUPPRESSION.sh
    • The err and out files are stored in temp/, called os.err and os.out. The output is saved in output_R/os/ and output_R/os_nonzero/.
  • INVAR_SIZE_CHARACTERISATION.sh
    • Aggregates fragment size information across all samples and saves an object that is later used for weighting of reads based on the enrichment ratio between mutant and wild-type fragments.
    • The output file is called [PREFIX].size_characterisation.rds and will be stored in output_R/.
    • To run: ./bin/step3_outlier_suppression/INVAR_SIZE_CHARACTERISATION.sh.
    • out and err files are in output_R/ folder INVAR_SIZE_CHARACTERISATION.out and INVAR_SIZE_CHARACTERISATION.err
  • ANNOTATE_WITH_OS.sh
    • Annotates the .rds data.frame (containing each data point) with whether that locus passes or fails the patient-specific outlier-suppression filter.
    • The output file is called [PREFIX].combined.os_[outlier_suppression_threshold_settings].rds and will be stored in output_R/.
    • To run: ./bin/step3_outlier_suppression/ANNOTATE_WITH_OS.sh.
    • The ANN_OS.err and ANN_OS.out files are stored in output_R/.

Step 4
This step generates an INVAR score per sample based on the parameters that each error-suppressed read was annotated with in the previous steps. Detailed statistical methods are provided in the Methods and Supplementary Methods of the INVAR manuscript. Subsequently, important files that can be explored locally are collected into one folder for ease of downloading.

  • GLRT.sh
    • Generates an INVAR score per sample using the Generalised Likelihood Ratio Test.
    • samples will be weighted by fragment length and tumour AF
    • signal will be accumulated across all loci for a given sample
    • To run: ./bin/step4_detection/GLRT.sh.
    • out and err files are in GLRT folder [NAME].GLRT.out and [NAME].GLRT.err

After GLRT script has completed run the following code to collect the results as a .txt file in the working: source config/config.sh; DATE=$(date "+%y%m%d"); grep "INVAR_SCORE" GLRT/* > ${FINAL_PREFIX}.INVAR_scores.${DATE}.txt this step runs immediately and no err and out files are generated An example output file can be found here: example_files/GLRT_example.txt.

  • FINALISE.sh
    • Combines the most important files into a folder called output_final/.
    • To run: ./bin/step4_detection/FINALISE.sh
    • Download each of the output files to a local folder, and proceed with step 5.
    • this step runs immediately and no err and out files are generated

Step 5
In this step, the INVAR score per sample is processed in a local instance of R. In R, run the get.INVAR_score() function, passing it the path to the text file output of scores INVAR_score.path, your SLX layout SLX_layout. A dataframe of sample names, INVAR scores and detected/non-detected will be generated. The ctDNA level as determined by the INVAR algorithm is also given.


Software license

The software license for the INVAR pipeline will be stated here upon publication. Patent applications protecting the methods described in this manuscript have been filed by Cancer Research Technologies/CRUK.


Contact

To report bugs, or if you need support, please contact us or create an issue.

Authors: J. C. M. Wan* & K. Heider* et al.
Contact: jonathan.wan@cruk.cam.ac.uk & katrin.heider@cruk.cam.ac.uk & rosenfeld.labadmin@cruk.cam.ac.uk
Website: www.rosenfeldlab.org

Updated