mi-faser documentation

Pre-Requirements

mi-faser runs on LINUX, MacOSX and WINDOWS systems.

Dependencies

Python >= 3.6
DIAMOND >= 0.8.8 (included; sources: https://github.com/bbuchfink/diamond)
WINDOWS: Visual C++ Redistributable *

Note: mi-faser was developed and optimized using DIAMOND v0.8.8, which is included in all release up to v1.11.4. This is also the version used in the accompanying publication [1]. All newer releases of mi-faser use the latest stable release of DIAMOND. mi-faser results for the first release (v1.2) with an updated version of DIAMOND (v0.9.13) were not affected by this (<0.1% difference; based on results for the artificial metagenome supplied as example dataset). According to the authors, more recent versions of DIAMOND offer substantial improvements regarding speed and memory usage as well as bugfixes. Thus, we strongly recommend to always use the latest version of DIAMOND (see Section: DIAMOND upgrade). This might alter mi-faser results slightly. However, results are expected to be enriched by new correct annotations rather than introducing mis-annotations.

Note that it is recommended to download and compile DIAMOND locally (https://github.com/bbuchfink/diamond) as this might have a significant impact on performance (due to special CPU instructions). However, this repository includes a pre-compiled version of DIAMOND to use.

Note that different split sizes could, at very rare occasions, result in minor deviations in mi-faser annotations. This is due to certain heuristics applied by DIAMOND when generating sequence alignments. We suggest to retain the split size for comparable analyses.

Optional extensions

SRA Toolkit >= 2.9.1 (NCBI)

If installed enables mi-faser to automatically retrieve and process read files deposited in the NCBI Sequence Read Archives SRA. Currently SRR, ERR and DRR identifiers are suppotted.

Reference Database

mi-faser was developed using a manually curated reference database of protein functions (GS database; DOI 10.5281/zenodo.1048269).

Since version 1.5 mi-faser also contains a new GS+ database, which extends the default GS database. The GS+ database includes additional 55 manually curated protein sequences, introducing 28 new E.C.s that represent important microbial functions in the environment.

To create an new reference database, refer to the paragraph Creating a reference database.

Usage

Standalone VS Web Service

The Standalone version of mi-faser partitions the user input into subsets analogue to the Web Service (http://services.bromberglab.org/mifaser/). However, those partitions are processed sequentially and not in parallel as in the Web Service. Thus the Standalone Version is only recommended for smaller jobs and is mainly thought to provide the mi-faser code base.

Docker

The pre-build mi-faser docker image is probably the most convenient way to run mi-faser locally or in any cloud infrastructure. The docker image can be used in the same way as the standalone version, however mounting of a common working directory into the virtual environment is required.

To create and execute a single instance of mi-faser using a locally mounted working directory run:

docker run --rm \
    -v <LOCAL_INPUT_DIRECTORY>:/input \
    -v <LOCAL_OUTPUT_DIRECTORY>:/output \
    bromberglab/mifaser -f <INPUT_FILE>

<INPUT_FILE> is a valid mi-faser input file located in <LOCAL_INPUT_DIRECTORY> on your host environment. By default, mi-faser reads inputfiles relative to /input and writes any output to /output. Thus, by bind mounting your local <LOCAL_INPUT_DIRECTORY> to /input inside the docker container, input files can be passed simply as relative paths to your <LOCAL_INPUT_DIRECTORY>. Similarly, by mounting a <LOCAL_OUTPUT_DIRECTORY> to /output inside the docker container, all mi-faser outputs can be accessed at the <LOCAL_OUTPUT_DIRECTORY>.

Standalone Python

Open a terminal and checkout the mi-faser repository:

git clone https://git@bitbucket.org/bromberglab/mifaser.git

or download the zipped version:

curl --remote-name https://bitbucket.org/bromberglab/mifaser/get/master.zip
unzip master.zip

Navigate to the mi-faser base directory and run mi-faser (Single or 2-Lane mode):

Single: input-file containing DNA reads, single http[s]/ftp[s] url or SRA accession ID (sra:<accession_id>):

$ python mifaser.py -f/--inputfile <INPUT_FILE>

2-Lane: two files (R1/R2), http[s]/ftp[s] urls or SRA accession IDs (sra:<accession_id1> sra:<accession_id2>):

$ python mifaser.py -l/--lanes <R1_FILE> <R2_FILE>

CLI

mi-faser help:

usage: python mifaser.py [-h] [-f INPUTFILE] [-l R1 R2] [-o OUTPUTFOLDER]
                         [-d DATABASEFOLDER] [-i DIAMONDFOLDER] [-m]
                         [-s SPLIT] [-S [SPLITMB]] [-t THREADS] [-c CPU] [-p]
                         [-n] [-u UPDATE] [-D [arg [arg ...]]] [-v] [-q]

mi-faser, microbiome - functional annotation of sequencing reads

a super-fast ( < 10min/10GB of reads ) and accurate ( > 90% precision ) method
for annotation of molecular functionality encoded in sequencing read data
without the need for assembly or gene finding.

Public web service: https://services.bromberglab.org/mifaser

Version: 1.55 [10/16/19]

optional arguments:
  -h, --help            show this help message and exit
  -f INPUTFILE, --inputfile INPUTFILE
                        input DNA reads file, http[s]/ftp[s] url or SRA
                        accession id (sra:<id>)
  -l R1 R2, --lanes R1 R2
                        2-Lane format (R1/R2) files, http[s]/ftp[s] url or SRA
                        accession ids (sra:<id_1> sra:<id_2>)
  -o OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
                        path to base output folder; default: INPUTFILE_out
  -d DATABASEFOLDER, --databasefolder DATABASEFOLDER
                        name of database located in database/ directory OR
                        absolute path to folder containing database files
  -i DIAMONDFOLDER, --diamondfolder DIAMONDFOLDER
                        path to folder containing diamond binary
  -m, --mapping         if flag is set all reads mappings will be generated
                        (reads{n=*} -> EC{n=1}, fasta)
  -s SPLIT, --split SPLIT
                        split by X sequences; default: 100k; 0 forces no split
  -S [SPLITMB], --splitmb [SPLITMB]
                        split by X MB; default: 25; (requires split from GNU
                        Coreutils)
  -t THREADS, --threads THREADS
                        number of threads; default: 1
  -c CPU, --cpu CPU     max cpus per thread; default: all available
  -p, --preserve        if flag is set intermediate results are kept
  -n, --no-check        if flag is set check for compatibility between diamond
                        database and binary is omitted
  -u UPDATE, --update UPDATE
                        valid update commands: { diamond[:version] }
  -D [arg [arg ...]], --createdb [arg [arg ...]]
                        create new reference database: <db_name>
                        <db_sequences.fasta> [merge_db=<name of db to merge
                        with>] [update_ec_annotations=<1|0>; default: 0]
  -v, --version         print mi-faser version
  -q, --quiet           if flag is set console output is logged to file

If you use *mi-faser* in published research, please cite:

Zhu, C., Miller, M., ... Bromberg, Y. (2017).
Functional sequencing read annotation for high precision microbiome analysis.
Nucleic Acids Res. [doi:10.1093/nar/gkx1209]
(https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1209/4670955)

mi-faser is developed by Chengsheng Zhu and Maximilian Miller.
Feel free to contact us for support at services@bromberglab.org.

This project is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)

Test: python mifaser.py -f files/test/artificial_mg.fasta -o files/test/out

Example

A demo dataset containing 10k reads is provided to verify a local mi-faser installation. Navigate to the mifaser base directory and run mi-faser with the following arguments:

$ python mifaser.py -f files/test/artificial_mg.fasta -o files/test/out

The resulting analysis will be located relative to the mifaser base directory at: files/test/out/.

DIAMOND upgrade

As DIAMOND (https://github.com/bbuchfink/diamond) is actively developed, we provide an easy way to upgrade (or downgrade) to another version. In case a specific version of DIAMOND is given as parameter, this version will be automatically downloaded and installed (default: latest release).

$ python mifaser.py --update diamond[:<DIAMOND_VERSION>]

Creating a reference database

mi-faser uses a manually curated reference database of protein functions. To create an alternative reference database, first store the desired set of protein sequences in a multi-FASTA file using the following format for the sequence headers:

>id|annotation|e.c.-number|additional_details

sequences.fasta:

>id|annotation|e.c.-number|additional_details
MKPNTDFMLIADGAKVLTQGNLTEHCAIEVSDGIICGLKSTISAEWTADKPHYRLTSGTL
VAGFIDTQVNGGGGLMFNHVPTLETLRLMMQAHRQFGTTAMLPTVITDDIEVMQAAADAV
AEAIDCQVPGIIGIHFEG
>id|annotation|e.c.-number|additional_details
MYYGLDIGGTKIELAIFDTQLALQDKWRLSTPGQDYSAFMATLAEQIEKADQQCGERGTV
GIALPGVVKADGTVISSNVPCLNQRRVAHDLAQLLNRTVAIGNDCRCFALSEAVLGVGRG
YSRVLGMI

Then run mi-faser using the -D/--createdb argument to create a new reference database my_database:

$ python mifaser.py -D my_database path/to/sequences.fasta

To use the new database run:

$ python mifaser.py -d my_database -f files/test/artificial_mg.fasta -o files/test/out

See the help menu (--help) for more details.

License

This project is licensed under NPOSL-3.0.

Citation

If you use mi-faser in published research, please cite:

Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G. H. F.-A., . . . Bromberg, Y. (2017). Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. doi:10.1093/nar/gkx1209

About

mi-faser is developed by Chengsheng Zhu and Maximilian Miller. Feel free to contact us for support: services@bromberglab.org.

Wiki

mifaser / docs