HTTPS SSH

polyCRACKER Version 1.0

Quick summary

polyCRACKER can be used to:

  1. Identify subgenomes

  2. Extract subgenomes

  3. Validate subgenomes

  4. Explorative analysis of subgenomes relative to genomic features

A full description of polyCRACKER is provided in the manuscript: PolyCRACKER, a robust method for the unsupervised partitioning of polyploid subgenomes by signatures of repetitive DNA evolution. Sean P Gordon, Joshua J Levy, John P Vogel https://www.biorxiv.org/content/early/2018/12/03/484832

Getting Started With PolyCRACKER (requires linux)

Run PolyCRACKER:

  1. Make sure conda environment is properly installed. You can find it here: https://anaconda.org/jlevy44/polyCRACKER
    • Download latest yml file from anaconda repo
    • conda env create -f [environment yml file].yml
    • activate it through source activate [environment name]
  2. Clone this repository to your project directory.
  3. cd [your project directory containing polyCRACKER.py]
  4. Move fasta file in question to ./fasta_files
  5. Edit config_polyCRACKER.txt (See below)
  6. Running pipeline: python polyCRACKER.py run_pipeline # use -h to list options
  7. Results should be in ./analysisOutputs/*/* directories * There's a cluster results directory containing initial clusters of subsequences, and final results directory containing final clusters after signal amplification. Sometimes signal amplification may fail, so can attain intermediate results by going into ./analysisOutputs/*/*/bootstrap_* directories and looking for extractedSubgenomes subdirectory containing fastas. * Extracted subgenome fasta files are still "chunked", but contain positional information with respect to scaffold of origin.
  8. Clustering plots found at in *html files in project directory.
  9. Additional plots can be made using python polyCRACKER.py plotPositions -h, and there are a few other plotting utilities.
  10. Pro tip: Can rerun/resume pipeline at various parts by setting parts of the config already run to 0 instead of 1.
  11. Pro tip: Use command python polyCRACKER.py number_repeatmers_per_subsequence to find a histogram of the number of repeat-mers present in each chunked genome fragment. File saved as kmers_per_subsequence.png * If this histogram is too skewed to low kmer counts in each subsequence, then either: * Reduce kmer size * Increase chunk size splitFastaLineLength * Reduce the low_count threshold * Set perfectmode to 1 * Consider adding the NonChunk = 1 to config
    * And/Or Enforce a higher MinChunkSize.
    * VERY IMPORTANT! If there are is not enough repeat content included in the subsequences, they will be hard to bin.
  12. Of course, run python polyCRACKER.py -h for more tips.
  13. Other tips on setting up the config file and running the pipeline are found by running the jupyter notebook ./tutorials/RunningPipeline.ipynb
    * Information on what each config parameter means is in this notebook. Highly recommend that you check this out.
    * Other examples of old configuration files in ./tutorials/old_configs
  14. Other downstream analyses not included here, but check out the html file described below for more commands.
  • Accessing additional help docs:
    * You can find them here after you download the repository: ./tutorials/help_docs/index.html
    * This is an html file that specifies some of the polyCRACKER commands. Still being updated.

  • Additional notes:
    * May need to reinstall pyamg if SpectralClustering is not working. pip uninstall then pip install.
    * Will add others but feel free to open up issues.

Running the test script to split algae genomes:

  1. cd [your project directory containing polyCRACKER.py]
  2. tar -xzvf ./test_data/test_fasta_files/algae.fa.tar.gz && mv algae.fa ./test_data/test_fasta_files/
  3. Activate conda environment
  4. python polyCRACKER.py test_pipeline -env [Your polyCRACKER conda environment]
  5. Results stored in test_results directory. Script may fail from time to time or not show good results because of random seeding, may have to rerun.

For more testing data: Tobacco (pseudomolecule-anchored and unanchored):
* ftp://ftp.solgenomics.net/genomes/Nicotiana_tabacum/edwards_et_al_2017/assembly/
Wheat * New wheat genome, not included in paper, though it has been analyzed before (ftp://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/triticum_aestivum/dna/)
* 2017 wheat genome https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/
* https://www.ncbi.nlm.nih.gov/assembly/GCA_002220415.2
Algae genomes (included in ./test_data/ folder)
* https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Creinhardtii
* https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_CsubellipsoideaC_169
Fungi (Ustilago)
* https://genome.jgi.doe.gov/Ustma1/Ustma1.home.html
* https://genome.jgi.doe.gov/Usthor1/Usthor1.home.html
* Fungi (Aspergillus)
* https://genome.jgi.doe.gov/Aspergillus/Aspergillus.info.html

Genome Comparison Tool and K-Mer Conservation Rules A separate utility of polyCRACKER than that demonstrated in the paper is the ability to compare the distribution of k-mers between different genomes/assemblies, and create a ployly/dash app for visualization. To establish a matrix of k-mers versus genomes for downstream analysis, please use bio_hyp_class command (-h) * Eg. nohup python polyCRACKER.py bio_hyp_class -f ../../,_,n -dk 5 -w ../../results/ -m 150 -l 23 -min 2 -max 25 > ../../analysis.log & * There are then scripts that can be used for downstream analysis (clustering, etc. not detailed here)