Overview

This is pretty much a work in progress. The pipeline currently consists of the
following scripts (reported with default settings):

*** Genereate the underlying taxonomy
$ ./generate_taxonomy.py --output taxonomy.txt --output_red taxonomy_reduced.txt /
  --pickle taxonomy.pkl | tee generate_taxonomy.log

[ ~3 hours ]. This script downloads, processes, and saves the NCBI taxonomy in 
a standard format with a fixed number of taxonomic levels.

*** Download the viral genomes (fna,ffn,faa) with corresponding taxonomy
$ ./repophlan_get_viruses.py --taxonomy taxonomy_reduced.txt --out_dir /
  repophlan_viruses --out_summary repophlan_viruses.txt | tee repophlan_viruses.log
[ ~1 hour ]. Downloads and save all sequences available for viruses (from
RefSeq). The files are saved in the 'repophlan_viruses' folder and the taxonomy
of each downloaded set of files is in 'repophlan_viruses.txt'

*** Download the microbial (bacteria+archaea) genomes (fna,ffn,faa,frn) with 
    corresponding taxonomy
$ ./repophlan_get_microbes.py --taxonomy taxonomy_reduced.txt --out_dir /
  microbes --nproc 20 --out_summary repophlan_microbes.txt /
  | tee repophlan_microbes.log
[ ~6 hours ]. Downloads and save all available sequences for microbes. It uses
multiple parallel connections to speed up the retrieval: common exceptions 
including issues with the NCBI ftp which temporarily rejects connections for
ftp load problems are handled gracefully with a 'delay and retry' policy. 
IMPORTANT: specifying more than 20 processors to run in parallel causes serious
problems in terms of exceeding the allowed number of connections by NCBI causing
long delays.

*** Download the single-celled Eukaryotes genomes (i.e. fungi and protozoa, with 
fna,ffn,faa) with corresponding taxonomy
$ ./repophlan_get_euks.py --taxonomy taxonomy_reduced.txt --out_dir_fungi fungi /
  --out_dir_protozoa protozoa --out_summary_fungi repophlan_fungi.txt /
  --out_summary_protozoa repophlan_protozoa.txt | tee repophlan_euks.log

As of Nov 8 2013, RepoPhlAn retrieves:
* 4958 viruses (each with fna, ffn, faa files)
* 12277 microbes (each with fna, ffn, frn, faa files)

For microbes, four files are generated for each genome:
* .fna : the genome in multifasta format (one or more contigs)
* .ffn : all the protein coding genes (no rRNA, tRNA)
* .faa : all the proteins
* .frn : all the non-coding genes (rRNA, tRNA)

Several aspects can be optimized and with a bit of refining about 20% more 
microbes could be retrieved. 


Known issues:

* For almost all the assembly additional informative file can be automatically
  downloaded:
  - .asn (probably not useful)
  - .gbk (genbank file for the assembly and for each scaffold)
  - .gff (annotations in almost free text format)
  - .ptt (tab-separated file including COG assignments and product names)
  - .rnt (annotation of rRNA and tRNAs)
  - .rpt (few metadata)
  - .val (binary file, not sure what's inside)
* about ~30 strains are missing in the generated taxonomy file. These are
  the representatives of sets of strains with multiple assemblies. It should
  be an easy fix in generate_taxonomy.py
* for some microbes and euks only a subset of the four types of files are present. These
  files are downloaded but the assembly will not appear in the 
  repophlan_microbes.txt for consistency. However, for some downstream analyses
  not all the four files are actually needed.
* several inconsistencies are present in the following file:
  ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
  entries with inconsistencies are skipped (with a 'try catch' policy), but few
  of these inconsistencies could actually be handled.
* it would be great to plug in into RePoPhlAn a gene-to-function mapping (e.g.
  KEGG, CAZY, ...)
* need extensive testing of both the taxonomy and the retrieved files





http://www.standardsingenomics.com/content/9/1/20#B11