Wiki
Clone wikiPracticalHaplotypeGraph / Pipeline_version1 / CreateReferenceIntervals
SCRIPT PURPOSE
Generate reference intervals for the Practical Haplotype Graph (PHG)
NOTES
- This script assumes it is run inside the PHG Docker container with predefined I/O paths
- The .gff file is assumed to be in JGI format: gene models have the "gene" name, and an "ID=..." field is present in annotation
RUNNING THE SCRIPT
#!bash docker run --rm \ -v /your_data_folder/:/tempFileDir/data \ maizegenetics/phg \ /CreateReferenceIntervals.sh -f your_reference.fasta -a your_reference.gene.gff3 [ ... optional parameters]
REQUIRED PARAMETERS
#!bash
-f <file name>
name of fasta file containing the reference sequence
-a <file name>
name of genome annotation file in .gff format containing gene model annotation
OPTIONAL PARAMETERS
#!bash -k <integer> Length of kmer used for determining repetitive regions Default: 11 -e <integer> Number of bases by which to expand gene models for initial reference interval selection Default: 1000 -m <integer> Distance (in bp) between genes below which gene models are merged Default: 100 -p <double> Proportion of kmers to be considered repetitive. This determines the high kmer count tail which is considered repetitive (e.g. the top 0.05 most frequent) Default: 0.1 -n <integer> Number of kmer copies (genome-wide) above which a kmer is considered repetitive. Overrides -p Default: none, -p is used by default -l <integer> The number of bases to consider when evaluating if a location in the genome is repetitive Default: 100 -s <integer> The step size (in bp) by which to proceed outward from a gene model when evaluating flanking regions Default: 10
SCRIPT RESULTS
Output location (subfolder in the input data folder)
- genomic_intervals_unique-timestamp
Relevant output contents
- reference_intervals_run.log -- a log file summarizing parameters for the run
- your_fasta.gene.expand.trimmed.summary_report.tsv -- a summary of seed gene model expansion
- your_fasta.kmer_count.tsv -- a complete list of kmer counts for kmers with count > 1
- your_fasta.gene.expand.trimmed.bed -- the final reference intervals, in BED format
Updated