An automated binning pipeline for single metagenomes, in particular host-associated and highly complex ones. Autometa is copyright 2018 Ian J. Miller, Evan Rees, Izaak Miller and Jason C. Kwan, and is released under the GNU Affero General Public License v3 (see LICENSE.txt). If you find Autometa useful to your work, please cite:
Miller, I. J.; Rees, E. R.; Ross, J.; Miller, I.; Baxa, J.; Lopera, J.; Kerby, R. L.; Rey, F. E.; Kwan, J. C. Autometa: Automated extraction of microbial genomes from individual shotgun metagenomes. bioRxiv, 2018. DOI: https://doi.org/10.1101/251462
Third party programs
Databases (these will be automatically downloaded the first time you run make_taxonomy_table.py). NOTE: You will need ~150 GB (at time of writing) to download and process these.
Python packages (Note: this list assumes the use of Anaconda Python)
Additionally, if you want to calculate your own contig coverages (rather than trusting the coverage values given by the SPAdes assembler), you will need:
First we install Prodigal, HMMER and DIAMOND.
sudo apt-get install prodigal hmmer build-essential zlib1g-dev mkdir diamond cd diamond wget http://github.com/bbuchfink/diamond/releases/download/v0.9.14/diamond-linux64.tar.gz tar xvf diamond-linux64.tar.gz
At this point you should add the diamond directory to your $PATH environmental variable.
If you want to calculate coverages, then also install bowtie2, samtools and bedtools:
sudo apt-get install bowtie2 bedtools wget https://github.com/samtools/samtools/releases/download/1.6/samtools-1.6.tar.bz2 tar -vxjf samtools-1.6.tar.bz2 cd samtools-1.6 ./configure --prefix=~/samtools make make install
You should add the ~/samtools/bin folder to your $PATH environmental variable.
Install Anaconda Python.
wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh chmod +x Anaconda2-5.0.1-Linux-x86_64.sh ./Anaconda2-5.0.1-Linux-x86_64.sh
Install Python modules.
conda install tqdm joblib biopython sudo apt-get install libatlas-base-dev conda install -c maxibor tsne
Now download Autometa.
git clone https://bitbucket.org/jason_c_kwan/autometa
You should now add the autometa/pipeline directory to your $PATH environmental variable
Running with Docker
As an alternative to doing all the above manually, we have provided an Autometa Docker image. First install Docker, then run the following commands:
git clone https://bitbucket.org/jason_c_kwan/autometa docker pull jasonkwan/autometa:latest
This will download a load of stuff and set up the image for you on your system (~3.5 GB).
You will now be able to run the Docker versions of the following scripts using the same command line arguments shown under "Usage".
|Normal script||Docker version|
Note that if you use SPAdes or something else that names contigs like this:
...then Autometa can use the coverage information in the contig names. If you've used another assembler, then you first have to make a coverage table.
calculate_read_coverage.py --assembly ~/autometa/test_data/scaffolds.fasta --processors 16 \ --forward_reads reads_R1.fastq.gz --reverse_reads reads_R2.fastq.gz
The above command will produce a table called coverage.tab, containing the coverages of each contig in the assembly file. In tests, this took 4.4 hours with 16 CPUs and 127 million paired reads.
Here we demonstrate the use of Autometa with the test dataset:
This dataset is the simulated dataset referred to as "78.125 Mbp" in our paper. For your convenience we have also included a pre-calculated coverage table that you can try out below.
Step 1: Split data into kingdom bins [optional]
We found that in host-associated metagenomes, this step vastly improves the binning performance of Autometa (and other pipelines) because less eukaryotic contigs will be binned into bacterial bins. However, if you are confident that you do not have eukaryotic contamination, or a mixture of Archaea and Bacteria, then you can skip this stage because it is rather computationally intensive.
make_taxonomy_table.py --assembly ~/autometa/test_data/scaffolds.fasta --processors 16 \ --length_cutoff 3000
If you want to use an external coverage table (see above), you can use the --cov_table flag to specify the path to the output of calculate_read_coverage.py in the command above.
The first time you run this script, it will automatically download the database files listed above, and format the nr database for DIAMOND to use. By default, unless you specify a directory with the --db_dir flag, a "databases" subdirectory will be made in the Autometa directory. In our testing, the above command took 9.8 hours to run on 16 CPUs the first time (where databases had to be downloaded and compiled), and 7.8 hours when the databases had already been downloaded. Of course, your mileage will vary depending on connection speed, computational specifications, etc., although the most important determinant of the time required will be the complexity of the input dataset.
Make_taxonomy_table.py will do the following:
- Identify genes in each contig with Prodigal.
- Search gene protein sequences against nr with DIAMOND.
- Determine the lowest common ancestor (LCA) of blast hits within 10% of the top bitscore.
- Determine the taxonomy of each contig by examining the LCA of each component protein (see paper for details)
Output files produced by make_taxonomy_table.py
|Bacteria.fasta||Contigs classified as bacterial|
|scaffolds_filtered.fasta||All contigs above the length cutoff|
|scaffolds_filtered.fasta.tab||Table describing the GC content, length and coverage of filtered contigs|
|scaffolds_filtered.orfs.daa||The output from DIAMOND (binary format)|
|scaffolds_filtered.orfs.faa||ORF translations obtained from Prodigal|
|scaffolds_filtered.orfs.lca||Table describing the lowest common ancestor (LCA) for each ORF|
|scaffolds_filtered.orfs.tab||The output from DIAMOND (tab delimited format)|
|scaffolds_filtered.txt||The full output from Prodigal|
|taxonomy.tab||Table with information from scaffolds_filtered.fasta.tab, and also the full assigned taxonomy for each contig|
|unclassified.fasta||Contigs unclassified on the Kingdom level|
Note that in our test data, there are no non-bacterial contigs. For other datasets, make_taxonomy_table.py will produce other fasta files, for example Eukaryota.fasta and Viruses.fasta.
Step 2: Bin bacterial contigs with BH-tSNE and DBSCAN
Note: This procedure can also be used on archaeal sequences, but will most likely not work well on other kingdoms, because the relevant single-copy gene markers are not included in Autometa.
Here we are running Autometa on the Bacteria.fasta file made in step 1.
run_autometa.py --assembly Bacteria.fasta --processors 16 --length_cutoff 3000 \ --taxonomy_table taxonomy.tab
If you want to use an external coverage table, use the --cov_table flag to specify the path to the output of calculate_read_coverage.py in the command above.
In the above command, we are supplying Bacteria.fasta to Autometa, and also the taxonomy table (taxonomy.tab) produced in step 1. If we supply a taxonomy table, then this information is used to help with clustering. Otherwise, Autometa clusters solely on 5-mer frequency and coverage. We are using the default output directory of the current working directory (this can be set with the --output_dir flag), and by default the pipeline assumes we are looking at bacterial contigs (use --kingdom archaea otherwise). The script will do the following:
- Find single-copy marker genes in the input contigs with HMMER
- Reduce the dimensions of 5-mer frequencies to two through BH-tSNE for each contig
- Cluster contigs based on BH-tSNE coordinates, coverage and (optionally) taxonomy
- Accept clusters that are estimated to be over 20% complete and 90% pure based on single-copy marker genes
- Unclustered contigs leftover will be re-clustered until no more acceptable clusters are yielded
For more details on the above process, please see our paper. If you include a taxonomy table in the run_autometa.py command, Autometa will attempt to further partition the data based on ascending taxonomic specificity (i.e. in the order phylum, class, order, family, genus, species) when clustering unclustered contigs from a previous attempt. We found that this is mainly useful if you have a highly complex metagenome (lots of species), or you have several related species at similar coverage level.
Output files produced by run_autometa.py
|Bacteria_filtered.hmm.tbl||Output from HMMER|
|Bacteria_filtered_marker.tab||Table describing the marker genes found in each contig|
|k-mer_matrix||Raw 5-mer frequencies for each contig|
|recursive_dbscan_output.tab||Output table containing the cluster (bin) for each contig|
Step 3: Recruit unclustered contigs to bins through supervised machine learning [optional]
In this step we use supervised machine learning to classify the unclustered contigs to the bins that we have produced (formally, the bins produced in step 2 are the training set). Depending on the size of your dataset, this step can be computationally intensive.
ML_recruitment.py --contig_tab recursive_dbscan_output.tab --recursive \ --k_mer_matrix k-mer_matrix --out_table ML_recruitment_output.tab
In the above command, we give ML_recruitment.py the output table from step 2 (recursive_dbscan_output.tab), as well as the k-mer_matrix file produced in step 2, and specify the output file (ML_recruitment_output.tab). We also use the --recursive flag, specifying that the script will run recursively, adding contigs it classifies to the training set and re-classifying until 0 more classifications are yielded. By default, classifications are only made if 10 out of 10 repeat classifications agree, and only if the classification would not increase the apparent contamination estimated by the presence of single-copy marker genes.
The specified output file is a table with the following columns:
|contig||Name of the contig in the input fasta|
|length||Length of the contig in bp|
|gc||GC (%) of contig|
|cov||Coverage of the contig|
|kingdom||Assigned taxonomic kingdom for the contig|
|phylum||Assigned taxonomic phylum for the contig|
|class||Assigned taxonomic class for the contig|
|order||Assigned taxonomic order for the contig|
|family||Assigned taxonomic family for the contig|
|genus||Assigned taxonomic genus for the contig|
|species||Assigned taxonomic species for the contig|
|taxid||Assigned NCBI taxonomy ID number for the contig|
|single_copy_PFAMs||Comma-delimited list of the PFAM accession numbers of single-copy marker genes found within the contig|
|num_single_copies||The number of single-copy marker genes found within the contig|
|bh_tsne_x||The X coordinate after dimension reduction by BH-tSNE|
|bh_tsne_y||The Y coordinate after dimension reduction by BH-tSNE|
|cluster||The cluster assigned by run_autometa.py|
|ML_expanded_clustering||The cluster assigned by ML_recruitment.py (includes previous assignments by run_autometa.py)|
Running all steps in sequence
For convenience, it is possible to run all three of the above steps through run_autometa.py, as shown below:
run_autometa.py --assembly ~/autometa/test_data/scaffolds.fasta --processors 16 \ --length_cutoff 3000 --maketaxtable --ML_recruitment
In the above command, the '--maketaxtable' flag tells run_autometa.py to run make_taxonomy_table.py, and the '--ML_recruitment' flag tells run_autometa.py to run ML_recruitment.py. Note, this runs ML_recruitment.py with the '--recursive' flag.
Analysis of results
At the end of the above process, you will have a master table (either recursive_dbscan_output.tab or ML_recruitment_output.tab) which describes each contig in your metagenome, including which bin each was assigned to. However, this does not directly tell you much about each bin, so you can further process your data with the following command:
cluster_process.py --bin_table ML_recruitment_output.tab --column ML_expanded_clustering \ --fasta Bacteria.fasta --do_taxonomy --db_dir ~/autometa/databases \ --output_dir cluster_process_output
This will do a number of things:
- Create a directory called "cluster_process_output" in the current working directory
- Split Bacteria.fasta into separate fastas for each cluster (including unclustered contigs) and write them to the cluster_process_output folder
- Create a file called cluster_summary_table, and store it in the cluster_process_output folder
- Create a file called cluster_taxonomy.tab, and store it in the cluster_process_output folder
Note: if you don't want to analyze the taxonomy of your clusters, you can omit the --do_taxonomy and --db_dir arguments. To analyze taxonomy here, you also need to have run make_taxonomy_table.py above.
Cluster_summary_table contains the following columns:
|cluster||The name of the cluster|
|size||The size of the cluster in bp|
|longest_contig||The length in bp of the longest contig in the cluster|
|n50||The n50 of the contigs in the cluster|
|number_contigs||The number of contigs in the cluster|
|completeness||The estimated completeness of the cluster, based on single-copy marker genes|
|purity||The estimated purity of the cluster, based on the number of single-copy marker genes that are duplicated in the cluster|
|av_cov||The average coverage of contigs in the cluster, weighted by contig length|
|av_gc||The average GC (%) of the contigs in the cluster, weighted by contig length|
The cluster_taxonomy.tab file similarly contains full taxonomic information for each cluster, assigned based on the assigned taxonomy of the component contigs.
Visualizing your results
Cluster_process.py gives you some information about the clusters, but it's a good idea to manually look at the data to convince yourself that Autometa did a good job. It's worth noting here that no binning pipeline (including Autometa) is guaranteed to give 100% correct results all the time, and before using the results in publications etc. you should use your biological common sense/judgement. With the tables you have produced, this is easy to do using R. Try typing the following in to R:
library(ggplot2) data = read.table('ML_recruitment_output.tab', header=TRUE, sep='\t') ggplot( data, aes( x = bh_tsne_x, y = bh_tsne_y, col = ML_expanded_clustering )) + \ geom_point( aes( alphs = 0.5, size = sqrt( data$length ) / 100 )) + \ guides( color = 'legend', size = 'none', alpha = 'none' ) + \ theme_classic() + xlab('BH-tSNE X') + ylab('BH-tSNE Y') + \ guides( color = guide_legend( title = 'Cluster/bin' ))
In the above chart the points represent contigs. They are plotted on the two dimensions that result from dimension-reduction by BH-tSNE - you can think of the distance between points as being roughly proportional to their differences in 5-mer frequency. The points are also scaled in size according to the contig length, and they are colored by the assigned cluster/bin. You can see that there are some bins which are well-separated from others, but there are other bins that are close together. Cases like these might be worth investigating manually if you think, for instance, that multiple Autometa bins close together could actually be different parts of the same genome. If clusters are close together, there is also the possibility that contigs in the region have been misassigned.
In addition to using nucleotide composition, Autometa uses coverage and can also use taxonomy to distinguish contigs with similar composition. We can also visualize these differences with R.
ggplot( data, aes( x = bh_tsne_x, y = bh_tsne_y, col = phylum )) + \ geom_point( aes( alphs = 0.5, size = sqrt( data$length ) / 100 )) + \ guides( color = 'legend', size = 'none', alpha = 'none' ) + \ theme_classic() + xlab('BH-tSNE X') + ylab('BH-tSNE Y') + \ guides( color = guide_legend( title = 'Phylum' ))
In the above plot, we have now colored the points by taxonomic phylum, and this reveals that several clusters that are close together in BH-tSNE space are in fact quite divergent from one another. This is probably the basis for Autometa's assignment of separate bins in these cases. In some cases, the contigs in a bin may in fact look divergent in plots like this (see, for example the grouping in the bottom left of the plot). You may want to manually examine cases such as these, but they could well be real if, for example, some contigs have few protein coding genes, or the organism is highly divergent from known sequences (see our paper here for some examples). In this particular dataset, the coverages of all genomes are fairly similar, as revealed in the next plot:
ggplot( data, aes( x = cov, y = gc, col = ML_expanded_clustering )) + \ geom_point( aes( alphs = 0.5, size = sqrt( data$length ) / 100 )) + \ guides( color = 'legend', size = 'none', alpha = 'none' ) + \ theme_classic() + xlab('Coverage') + ylab('GC (%)') + \ guides( color = guide_legend( title = 'Cluster/bin' )) + \ scale_x_continuous( limits = c( 200, 250 ))
In the above plot, the points are colored by cluster/bin again, and you can see that in this case, coverage is not much of a distinguishing feature. In other datasets, you may see closely related genomes at different coverages, which will be separatable by Autometa.