Wiki

Clone wiki

PracticalHaplotypeGraph / Pipeline_version1 / FindPathMinimap2

Overview

This shell script will use the Minimap2 index created by IndexPangenome.sh and will align a set of reads to the graph and then will use a HMM to find the most likely path through the graph given the alignments. The database "paths" and "read_mapping_paths" tables are populated from the plugins called by this script.

Pipeline Steps

  1. Run FastqToMappingPlugin to map a set of reads to the pangenome fasta file. This plugin will make use of a keyfile and will store the mappings in the DB.

  2. Run HapCountBestPathToTextPlugin to take the mappings from the DB and use a HMM to find the optimal path through the DB. This will then store the paths into the DB.

KeyFiles

The FindPathMinimap2 Pipeline will require the use of 2 keyfiles. The second is autogenerated, but could be changed if different results are desired. Both files need to be tab-separated. If there are entries in the keyfile, but not found on the filesystem, the pipelines will skip over those entries.

More information about the keyfiles can be found here: https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/DockerPipeline/FindPathKeyFiles

Example Run Command

#!bash

FindPathMinimap2.sh [BASE_HAPLOTYPE_NAME] [CONFIG_FILE_NAME] [HAPLOTYPE_METHOD] [HAPLOTYPE_METHOD_FIND_PATH] [HAPCOUNT_METHOD_NAME] [PATH_METHOD_NAME] [READ_KEY_FILE] [PATH_KEY_FILE]

Command Line Flags

#!bash
BASE_HAPLOTYPE_NAME: This is the base of the name of the haplotype fasta file. This file should have been indexed in the previous step.
CONFIG_FILE_NAME: This is the path to the config.txt file used to create the DB.  All the needed DB connection information should be in here.
HAPLOTYPE_METHOD: Method name of the haplotypes in the graph that were used to generate the haplotype fasta file in IndexPangenome.sh.  This needs to match exactly otherwise the results will not be correct.
HAPLOTYPE_METHOD_FIND_PATH: This method can be the same as HAPLOTYPE_METHOD, but it is typically used to only include anchor reference ranges when running FindPath.  Typically, finding paths through the interanchor reference ranges can cause additional errors.  And example of what could be used here is this: HAPLOTYPE_METHOD,refRegionGroup to only include the refRegionGroup refRange group in the Graph used for finding the path.
HAPCOUNT_METHOD_NAME: Name of the Haplotype Mapping method used to upload the ReadMapping files to the DB.  This is currently not used so a Dummy value can be used.  It will be implemented in the future.
PATH_METHOD_NAME: Name of the Path Mapping method used to upload the Paths to the DB. This method name will be used in the next step (ExportPath.sh) to extract out the paths from the DB.
READ_KEY_FILE: This name needs to match the name of the keyfile.  This keyfile will describe what fastq files need to be aligned together and also denotes the required metadata fields which are stored in the DB.
PATH_KEY_FILE:  This name is what the path finding keyfile will be named.  FastqToMappingPlugin will create this file and then BestHaplotypePathPlugin will use it to find paths.  Note that FastqToMappingPlugin will group reads by taxon and all the mappings for a given taxon will be used when finding the paths.

Docker Commands

When FindPathMinimap2.sh is run as part of a Docker container script, the Docker script expects the following directory mount points:

  • Mount localMachine:/pathToInputs/FastQFiles/ to docker:/tempFileDir/data/fastq
  • Mount localMachine:/pathToOutputs/ to docker:/tempFileDir/outputDir/
  • Mount localMachine:/pathToPangenomeIndex/ to docker:/tempFileDir/outputDir/pangenome/
  • Mount localMachine:/pathToInputs/config.txt to docker:/tempFileDir/data/config.txt
  • Mount localMachine:/pathToInputs/phg.db to docker:/tempFileDir/outputDir/phgTestMaizeDB.db. This needs to match what is in the config file.

It is expected the database is stored in the User specified outputDir that is mounted below and the config.txt specifies the database name and login parameters.

It is critical that the .mmi file is mounted to /tempFileDir/outputDir/pangenome/ in the docker. Otherwise this will not work correctly.

If you see this error: ERROR net.maizegenetics.plugindef.AbstractPlugin - Haplotype count methodid not found in db for method : HAP_COUNT_METHOD, it means that something went wrong during the ReadMapping step. Double check the -v parameters and make sure the mmi file is in /tempFileDir/outputDir/pangenome/

An example Docker script to run the FindPathMinimap2.sh shell script is:

#!bash
DB=/workdir/user/DockerTuningTests/DockerOutput/phgTestMaizeDB.db
PANGENOME_DIR=/workdir/user/DockerTuningTests/DockerOutput/PangenomeFasta/  

docker run --name small_seq_test_container --rm \ 
                    -w / \
                    -v /workdir/user/DockerTuningTests/DockerOutput/:/tempFileDir/outputDir/ \
                    -v /workdir/user/DockerTuningTests/InputFiles/GBSFastq/:/tempFileDir/data/fastq/ \
                    -v /workdir/user/DockerTuningTests/InputFiles/config.txt:/tempFileDir/data/configSQLite.txt \
                    -v ${DB}:/tempFileDir/outputDir/phgTestMaizeDB.db \
                    -v ${PANGENOME_DIR}:/tempFileDir/outputDir/pangenome/ \
                    -t maizegenetics/phg:latest \
                    /FindPathMinimap2.sh phgSmallSeqSequence configSQLite.txt \
                    CONSENSUS CONSENSUS,refRegionGroup \
                    HAP_COUNT_METHOD PATH_METHOD \
                    /tempFileDir/data/fastq/genotypingKeyFile.txt \
                    /tempFileDir/data/fastq/genotypingKeyFile_pathKeyFile.txt

PANGENOME_DIR must match the directory set in the IndexPangenome step.

The --name parameter provides a name for the container. This is optional.

The --rm parameter indicates the container should be deleted when the program finishes executing. This is optional.

The -v directives are used to mount data from the user machine into the Docker. The path preceding the ":" is the path on the user machine. The directory path following the ":" are the paths inside the Docker where the user home directories will be mounted.

The "-t" directive indicates the Docker image of which this container will be an instance. The last line tells the Docker container to run the FindPath.sh script which is found in the root directory. The items following are the parameters to the FindPath.sh script.

Updated