Wiki
Clone wikimifaser / docs
mi-faser documentation
Pre-Requirements
mi-faser runs on LINUX, MacOSX and WINDOWS systems.
Dependencies
- Python >= 3.6
- DIAMOND >= 0.8.8 (included; sources: https://github.com/bbuchfink/diamond)
- WINDOWS: Visual C++ Redistributable *
Note: mi-faser was developed and optimized using DIAMOND v0.8.8, which is included in all release up to v1.11.4. This is also the version used in the accompanying publication [1]. All newer releases of mi-faser use the latest stable release of DIAMOND. mi-faser results for the first release (v1.2) with an updated version of DIAMOND (v0.9.13) were not affected by this (<0.1% difference; based on results for the artificial metagenome supplied as example dataset). According to the authors, more recent versions of DIAMOND offer substantial improvements regarding speed and memory usage as well as bugfixes. Thus, we strongly recommend to always use the latest version of DIAMOND (see Section: DIAMOND upgrade). This might alter mi-faser results slightly. However, results are expected to be enriched by new correct annotations rather than introducing mis-annotations.
Note that it is recommended to download and compile DIAMOND locally (https://github.com/bbuchfink/diamond) as this might have a significant impact on performance (due to special CPU instructions). However, this repository includes a pre-compiled version of DIAMOND to use.
Note that different split sizes could, at very rare occasions, result in minor deviations in mi-faser annotations. This is due to certain heuristics applied by DIAMOND when generating sequence alignments. We suggest to retain the split size for comparable analyses.
Optional extensions
- SRA Toolkit >= 2.9.1 (NCBI)
If installed enables mi-faser to automatically retrieve and process read files deposited in the NCBI Sequence Read Archives SRA. Currently SRR, ERR and DRR identifiers are suppotted.
Reference Database
mi-faser was developed using a manually curated reference database of protein functions (GS database; DOI 10.5281/zenodo.1048269).
Since version 1.5 mi-faser also contains a new GS+ database, which extends the default GS database. The GS+ database includes additional 55 manually curated protein sequences, introducing 28 new E.C.s that represent important microbial functions in the environment.
To create an new reference database, refer to the paragraph Creating a reference database.
Usage
Standalone VS Web Service
The Standalone version of mi-faser partitions the user input into subsets analogue to the Web Service (http://services.bromberglab.org/mifaser/). However, those partitions are processed sequentially and not in parallel as in the Web Service. Thus the Standalone Version is only recommended for smaller jobs and is mainly thought to provide the mi-faser code base.
Docker
The pre-build mi-faser docker image is probably the most convenient way to run mi-faser locally or in any cloud infrastructure. The docker image can be used in the same way as the standalone version, however mounting of a common working directory into the virtual environment is required.
To create and execute a single instance of mi-faser using a locally mounted working directory run:
docker run --rm \ -v <LOCAL_INPUT_DIRECTORY>:/input \ -v <LOCAL_OUTPUT_DIRECTORY>:/output \ bromberglab/mifaser -f <INPUT_FILE>
/input
and writes any output to /output
. Thus, by bind mounting your local <LOCAL_INPUT_DIRECTORY> to /input
inside the docker container, input files can be passed simply as relative paths to your <LOCAL_INPUT_DIRECTORY>. Similarly, by mounting a <LOCAL_OUTPUT_DIRECTORY> to /output
inside the docker container, all mi-faser outputs can be accessed at the <LOCAL_OUTPUT_DIRECTORY>.
Standalone Python
Open a terminal and checkout the mi-faser repository:
git clone https://git@bitbucket.org/bromberglab/mifaser.git
curl --remote-name https://bitbucket.org/bromberglab/mifaser/get/master.zip unzip master.zip
Navigate to the mi-faser base directory and run mi-faser (Single or 2-Lane mode):
Single: input-file containing DNA reads, single http[s]/ftp[s] url or SRA accession ID (sra:<accession_id>):
$ python mifaser.py -f/--inputfile <INPUT_FILE>
2-Lane: two files (R1/R2), http[s]/ftp[s] urls or SRA accession IDs (sra:<accession_id1> sra:<accession_id2>):
$ python mifaser.py -l/--lanes <R1_FILE> <R2_FILE>
CLI
mi-faser help:
usage: python mifaser.py [-h] [-f INPUTFILE] [-l R1 R2] [-o OUTPUTFOLDER] [-d DATABASEFOLDER] [-i DIAMONDFOLDER] [-m] [-s SPLIT] [-S [SPLITMB]] [-t THREADS] [-c CPU] [-p] [-n] [-u UPDATE] [-D [arg [arg ...]]] [-v] [-q] mi-faser, microbiome - functional annotation of sequencing reads a super-fast ( < 10min/10GB of reads ) and accurate ( > 90% precision ) method for annotation of molecular functionality encoded in sequencing read data without the need for assembly or gene finding. Public web service: https://services.bromberglab.org/mifaser Version: 1.55 [10/16/19] optional arguments: -h, --help show this help message and exit -f INPUTFILE, --inputfile INPUTFILE input DNA reads file, http[s]/ftp[s] url or SRA accession id (sra:<id>) -l R1 R2, --lanes R1 R2 2-Lane format (R1/R2) files, http[s]/ftp[s] url or SRA accession ids (sra:<id_1> sra:<id_2>) -o OUTPUTFOLDER, --outputfolder OUTPUTFOLDER path to base output folder; default: INPUTFILE_out -d DATABASEFOLDER, --databasefolder DATABASEFOLDER name of database located in database/ directory OR absolute path to folder containing database files -i DIAMONDFOLDER, --diamondfolder DIAMONDFOLDER path to folder containing diamond binary -m, --mapping if flag is set all reads mappings will be generated (reads{n=*} -> EC{n=1}, fasta) -s SPLIT, --split SPLIT split by X sequences; default: 100k; 0 forces no split -S [SPLITMB], --splitmb [SPLITMB] split by X MB; default: 25; (requires split from GNU Coreutils) -t THREADS, --threads THREADS number of threads; default: 1 -c CPU, --cpu CPU max cpus per thread; default: all available -p, --preserve if flag is set intermediate results are kept -n, --no-check if flag is set check for compatibility between diamond database and binary is omitted -u UPDATE, --update UPDATE valid update commands: { diamond[:version] } -D [arg [arg ...]], --createdb [arg [arg ...]] create new reference database: <db_name> <db_sequences.fasta> [merge_db=<name of db to merge with>] [update_ec_annotations=<1|0>; default: 0] -v, --version print mi-faser version -q, --quiet if flag is set console output is logged to file If you use *mi-faser* in published research, please cite: Zhu, C., Miller, M., ... Bromberg, Y. (2017). Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. [doi:10.1093/nar/gkx1209] (https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1209/4670955) mi-faser is developed by Chengsheng Zhu and Maximilian Miller. Feel free to contact us for support at services@bromberglab.org. This project is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0) Test: python mifaser.py -f files/test/artificial_mg.fasta -o files/test/out
Example
A demo dataset containing 10k reads is provided to verify a local mi-faser installation. Navigate to the mifaser base directory and run mi-faser with the following arguments:
$ python mifaser.py -f files/test/artificial_mg.fasta -o files/test/out
DIAMOND upgrade
As DIAMOND (https://github.com/bbuchfink/diamond) is actively developed, we provide an easy way to upgrade (or downgrade) to another version. In case a specific version of DIAMOND is given as parameter, this version will be automatically downloaded and installed (default: latest release).
$ python mifaser.py --update diamond[:<DIAMOND_VERSION>]
Creating a reference database
mi-faser uses a manually curated reference database of protein functions. To create an alternative reference database, first store the desired set of protein sequences in a multi-FASTA file using the following format for the sequence headers:
>id|annotation|e.c.-number|additional_details
sequences.fasta:
>id|annotation|e.c.-number|additional_details MKPNTDFMLIADGAKVLTQGNLTEHCAIEVSDGIICGLKSTISAEWTADKPHYRLTSGTL VAGFIDTQVNGGGGLMFNHVPTLETLRLMMQAHRQFGTTAMLPTVITDDIEVMQAAADAV AEAIDCQVPGIIGIHFEG >id|annotation|e.c.-number|additional_details MYYGLDIGGTKIELAIFDTQLALQDKWRLSTPGQDYSAFMATLAEQIEKADQQCGERGTV GIALPGVVKADGTVISSNVPCLNQRRVAHDLAQLLNRTVAIGNDCRCFALSEAVLGVGRG YSRVLGMI
Then run mi-faser using the -D/--createdb argument to create a new reference database my_database:
$ python mifaser.py -D my_database path/to/sequences.fasta
To use the new database run:
$ python mifaser.py -d my_database -f files/test/artificial_mg.fasta -o files/test/out
See the help menu (--help) for more details.
License
This project is licensed under NPOSL-3.0.
Citation
If you use mi-faser in published research, please cite:
Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G. H. F.-A., . . . Bromberg, Y. (2017). Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. doi:10.1093/nar/gkx1209
About
mi-faser is developed by Chengsheng Zhu and Maximilian Miller. Feel free to contact us for support: services@bromberglab.org.
Updated