Wiki

Clone wiki

PracticalHaplotypeGraph / UserInstructions / ImputeWithPHG_fastq-heterozygous

Impute variants from any fastq file

Quick Start

  1. Change param1, param2, param3, and param4 to match file paths on your computer.
  2. Run phg findpaths_diploid [config.txt]

By Default this process will run FindPathMinimap2.sh and the DiploidPath plugin

Details

This process will align all fastqs in the directory indicated param1 to the indexed pangenome created in IndexPangenome.sh. Mappings to haplotypes are then reported, picking the 1-2 best haplotypes hit at every reference range. An HMM is then run to find the most probable path through the haplotype graph. The default is to report haplotypes at every reference range with parameter minreads=0, however to restrict it to sampled ranges, you can increase this value. The process is performed twice in parallel to produce 2 paths through the graph, representing the diploid genome.

This shell script will use the Minimap2 index created by IndexPangenome.sh and will align a set of reads to the graph and then will use a HMM to find the most likely path through the graph given the alignments. The database "paths" and "read_mapping_paths" tables are populated from the plugins called by this script.

Kitchen Sink

  1. Run FastqToMappingPlugin to map a set of reads to the pangenome fasta file. This plugin will make use of a keyfile and will store the mappings in the DB.

  2. Run HapCountBestPathToTextPlugin to take the mappings from the DB and use a HMM to find the optimal path through the DB. This will then store the paths into the DB.

There are 8 parameters used in this step:

  • BASE_HAPLOTYPE_NAME: This is the base of the name of the haplotype fasta file. This file should have been indexed in the previous step.
  • CONFIG_FILE_NAME: This is the path to the config.txt file used to create the DB. All the needed DB connection information should be in here.
  • HAPLOTYPE_METHOD: Method name of the haplotypes in the graph that were used to generate the haplotype fasta file in IndexPangenome.sh. This needs to match exactly otherwise the results will not be correct.
  • HAPLOTYPE_METHOD_FIND_PATH: This method can be the same as HAPLOTYPE_METHOD, but it is typically used to only include anchor reference ranges when running FindPath. Typically, finding paths through the interanchor reference ranges can cause additional errors. An example of what could be used here is this: HAPLOTYPE_METHOD,refRegionGroup to only include the refRegionGroup refRange group in the Graph used for finding the path.
  • HAPCOUNT_METHOD_NAME: Name of the Haplotype Mapping method used to upload the ReadMapping files to the DB. This is currently not used so a dummy value can be used. It will be implemented in the future.
  • PATH_METHOD_NAME: Name of the Path Mapping method used to upload the Paths to the DB. This method name will be used in the next step (phg exportPath) to extract out the paths from the DB.
  • READ_KEY_FILE: This name needs to match the name of the keyfile. This keyfile will describe what fastq files need to be aligned together and also denotes the required metadata fields which are stored in the DB.
  • PATH_KEY_FILE: This name is what the path finding keyfile will be named. FastqToMappingPlugin will create this file and then BestHaplotypePathPlugin will use it to find paths. Note that FastqToMappingPlugin will group reads by taxon and all the mappings for a given taxon will be used when finding the paths.

Details on running this step with wrapper scripts

When running this step on the command line, all file paths and parameters are set in the config file. The only call that needs to be run in the terminal is phg findPath /path/to/config.txt. If you would like to overwrite the parameters set in the config file, you can do that by setting the parameters on the command line directly.

For example, to ignore the config file HAPCOUNT_METHOD_NAME level and set one directly, you could run:

phg findPaths -configFile /path/to/config.txt -HAPCOUNT_METHOD_NAME MyNewMethod1

You can also run the FindPathMinimap2.sh bash script directly with the following run command:

#!bash

FindPathMinimap2.sh [BASE_HAPLOTYPE_NAME] [CONFIG_FILE_NAME] [HAPLOTYPE_METHOD] [HAPLOTYPE_METHOD_FIND_PATH] [HAPCOUNT_METHOD_NAME] [PATH_METHOD_NAME] [READ_KEY_FILE] [PATH_KEY_FILE]

Details on running this step through docker

When FindPathMinimap2.sh is run as part of a Docker container script, the Docker script expects the following directory mount points:

  • Mount localMachine:/pathToInputs/FastQFiles/ to docker:/tempFileDir/data/fastq
  • Mount localMachine:/pathToOutputs/ to docker:/tempFileDir/outputDir/
  • Mount localMachine:/pathToPangenomeIndex/ to docker:/tempFileDir/outputDir/pangenome/
  • Mount localMachine:/pathToInputs/config.txt to docker:/tempFileDir/data/config.txt
  • Mount localMachine:/pathToInputs/phg.db to docker:/tempFileDir/outputDir/phgTestMaizeDB.db. This needs to match what is in the config file.

It is expected the database is stored in the User specified outputDir that is mounted below and the config.txt specifies the database name and login parameters.

It is critical that the .mmi file is mounted to /tempFileDir/outputDir/pangenome/ in the docker. Otherwise this will not work correctly.

If you see this error: ERROR net.maizegenetics.plugindef.AbstractPlugin - Haplotype count methodid not found in db for method : HAP_COUNT_METHOD, it means that something went wrong during the ReadMapping step. Double check the -v parameters and make sure the mmi file is in /tempFileDir/outputDir/pangenome/

An example Docker script to run the FindPathMinimap2.sh shell script is:

#!bash
DB=/workdir/user/DockerTuningTests/DockerOutput/phgTestMaizeDB.db
PANGENOME_DIR=/workdir/user/DockerTuningTests/DockerOutput/PangenomeFasta/  

docker run --name small_seq_test_container --rm \ 
                    -w / \
                    -v /workdir/user/DockerTuningTests/DockerOutput/:/tempFileDir/outputDir/ \
                    -v /workdir/user/DockerTuningTests/InputFiles/GBSFastq/:/tempFileDir/data/fastq/ \
                    -v /workdir/user/DockerTuningTests/InputFiles/config.txt:/tempFileDir/data/configSQLite.txt \
                    -v ${DB}:/tempFileDir/outputDir/phgTestMaizeDB.db \
                    -v ${PANGENOME_DIR}:/tempFileDir/outputDir/pangenome/ \
                    -t maizegenetics/phg:latest \
                    /FindPathMinimap2.sh phgSmallSeqSequence configSQLite.txt \
                    CONSENSUS CONSENSUS,refRegionGroup \
                    HAP_COUNT_METHOD PATH_METHOD \
                    /tempFileDir/data/fastq/genotypingKeyFile.txt \
                    /tempFileDir/data/fastq/genotypingKeyFile_pathKeyFile.txt

PANGENOME_DIR must match the directory set in the IndexPangenome step.

The --name parameter provides a name for the container. This is optional.

The --rm parameter indicates the container should be deleted when the program finishes executing. This is optional.

The -v directives are used to mount data from the user machine into the Docker. The path preceding the ":" is the path on the user machine. The directory path following the ":" are the paths inside the Docker where the user home directories will be mounted.

The "-t" directive indicates the Docker image of which this container will be an instance. The last line tells the Docker container to run the FindPath.sh script which is found in the root directory. The items following are the parameters to the FindPath.sh script.

Files

Config file

An example can be found here: Master config file

KeyFiles

The FindPathMinimap2 Pipeline will require the use of 2 keyfiles. The second is autogenerated, but could be changed if different results are desired. Both files need to be tab-separated. If there are entries in the keyfile, but not found on the filesystem, the pipelines will skip over those entries.

More information about the keyfiles can be found here:

Plugins

Plugin 1

Plugin 2

Troubleshooting

Return to Step 3 pipeline

Return to Wiki Home

Updated