JGI-MiniScrub /

Filename Size Date modified Message
18.8 KB
5.7 KB
1.9 KB
1.1 KB
2.6 KB
26.7 KB
880 B
17.7 KB

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

Current version: 0.4 | October 1, 2018

For those in a hurry, see the "Basic Installation and Usage" section below.

MiniScrub is a de novo long sequencing read preprocessing method that improves read quality by predicting and removing ("scrubbing") read segments that have a high concentration of errors. Since long read technologies have high error rates, read scrubbing can be used to improve downstream applications such as alignment or assembly. MiniScrub uses the de novo approximate read-to-read alignment method MiniMap, then uses a Convolutional Neural Network (CNN) to predict read segments with high concentrations of errors based on these read-to-read alignments. MiniScrub is simple to use, BSD licensed, and completely de novo. As MiniScrub uses deep learning, it will be much faster on GPUs, which are highly recommended.

For more details on how to use MiniScrub, please see below. For more details on the method itself, please see the paper below, and please cite it if you use it for your research:

http://biorxiv.org/cgi/content/short/433573v1

Please see the README.md, legal.txt, and license.txt files for relevant legal and licensing information. MiniScrub is under a modified BSD license and is built on software under other permissive free software licenses.

NOTE: the pretrained model that we released on this repo was trained on nanopore reads and may not work well on PacBio data. In this case, it may be advisable to train your own model using the cigarToPctId.py and train.py scripts.

Installation and Basic Usage

It is highly recommended that you simply use the docker repository because MiniScrub has a lot of dependencies to set up. There are two docker containers available, miniscrub-gpu and miniscrub-cpu, depending on whether you have a GPU available (highly recommended, as GPUs are much faster for deep learning). Note that you need docker set up to use either version (see https://docs.docker.com/install/) and for the GPU version you additionally need nvidia-docker (see https://github.com/NVIDIA/nvidia-docker).

How to use miniscrub-gpu docker:

docker pull nlapier2/miniscrub-gpu
nvidia-docker run --rm -it -v /path/to/directory/with/reads:/app/workdir/ nlapier2/miniscrub-gpu
python3 jgi-miniscrub/miniscrub.py workdir/my-reads.fastq

How to use miniscrub-cpu docker:

docker pull nlapier2/miniscrub-cpu
docker run --rm -it -v /path/to/directory/with/reads:/app/workdir/ nlapier2/miniscrub-cpu
python3 jgi-miniscrub/miniscrub.py workdir/my-reads.fastq

The -v flag mounts the directory with your reads into the docker image. It is necessary to access your reads from within the docker container. Here we are mounting the directory with your reads into the directory /app/workdir/ within the docker container. The home directory is /app.

Otherwise, if you do not have docker, you can install the repo with git. We have included a script called nodocker-setup.py that can help with installation; please run "python3 nodocker-setup.py -h" for instructions on which flags to use. By default, the script assumes that you have GPUs and root access and do not want to install the other tools used in the MiniScrub paper, but each of those options can be changed, using the --nogpu, --noroot, and --full options, respectively. Run the setup script with sudo if you are choosing the root install.

How to install with git:

git clone https://nlapier2@bitbucket.org/berkeleylab/jgi-miniscrub.git
cd jgi-miniscrub
sudo python3 nodocker-setup.py
python3 miniscrub.py my-reads.fastq

If you do not have git or docker:

wget https://bitbucket.org/berkeleylab/jgi-miniscrub/get/f87c5e89f3b0.zip
unzip f87c5e89f3b0.zip
rm f87c5e89f3b0.zip
mv berkeleylab-jgi-miniscrub-f87c5e89f3b0 jgi-miniscrub
cd jgi-miniscrub
sudo python3 nodocker-setup.py
python3 miniscrub.py my-reads.fastq

In these examples, the output would be scrubbed-my-reads.fastq. Output filename can be set with the --output flag. Many more options are described below in greater detail.

Data and Replication

Some users may wish to download the data and replicate the results for the paper. We caution that MiniScrub's results will not be exactly the same on each run due to random factors in training and pileup image generation, but they should be similar.

The Low Complexity (LC) and High Complexity (HC) community reads can be downloaded from the following links, respectively:

https://portal.nersc.gov/archive/home/r/regan/www/X0124/

https://portal.nersc.gov/archive/home/r/regan/www/X0125/

To run the experiments with the assemblers and other preprocessing tools, either use a docker container or do a --full install using nodocker-setup.py and then follow the details in the MiniScrub paper. For the other tools, we used the default settings and ran them using the instructions in their GitHub pages.

Main script: miniscrub.py

The miniscrub.py script is an end-to-end solution for read scrubbing, performing the MiniMap, pileup image generation, and deep learning steps all in one. The vast majority of users will only need this script, as other scripts are for advanced users who wish to customize the options of each step. Basic usage, as shown below, is very simple:

python3 miniscrub.py my-reads.fastq

Which will produce (by default) a file called scrubbed-my-reads.fastq that contains the scrubbed reads. The name of the output file can be controlled by using the --output flag, such as:

python3 miniscrub.py my-reads.fastq --output custom-name.fastq

For a detailed list of options, enter:

python3 miniscrub.py -h

If this produces an error, please ensure that you have installed this repository properly following the instructions above. Below, each option for this script is described.

Common Options

--compression --> should be set to "gzip" (--compression gzip) if file is compressed with gzip (.gz extension). Options are "none" (default) and "gzip".

--cutoff --> used to specify the percent identity (percent of correct bases) of a read segment below which you want to scrub the segment. Default is 0.8 (80% identity). If you want to retain more data at the cost of retaining more errors, you would lower this number, for example to 0.75. If you want to remove most errors and are willing to lose most data, you would raise this number, for example to 0.85. Setting the value below 0.7 is likely to perform almost no scrubbing and is basically useless; likewise, setting the value at 0.9 or higher is likely to scrub almost all data and is useless.

--mask --> Use to replace scrubbed segments with Ns instead of removing the scrubbed segments. May be useful for some applications such as assembly.

--min_length --> Will not keep scrubbed read segments below this length. Default is 500 (bases). Useful to avoid short noisy read segments that can cause problems in some applications like assembly.

--output --> Specify the name of the file that you would like to write the scrubbed reads to. Default is to add "scrubbed-" to the front of the input file name.

--processes --> Number of processes to use; pileup image generation is multiprocessed. Default is 16. Decreasing this will decrease memory usage but increase runtime; increasing will increase memory usage and decrease runtime.

--reads --> Path to your fastq file. Only required argument.

--verbose --> Print status updates along the way. Default is to produce no output. No value to specify, just use "--verbose" if you want the status updates.

Advanced Options

--debug --> specify the number of reads to produce output for. This can be used to test the script on a smaller dataset. However, it is not generally needed for users.

--input --> If you use pileup.py to generate your own pileup images, specify the directory that they are stored in with this flag.

--labels --> If you have ground truth labels for your data, generated by cigarToPctId.py, you can specify them with this script to test how accurate the model is on your data.

--limit_length --> Excludes reads from your analysis above a certain length. For instance, "--limit_length 1000000" excludes reads with over 100000 bases. This is recommended if you have limited memory.

--limit_fastq --> This limits the number of reads scanned from the fastq file that you specify with reads. Useful for debugging purposes to get results faster, but not usually needed for users. --limit_paf --> Used to limit the number of reads scanned from the minimap .paf file. Useful for fast debugging, but not usually not needed for users.

--load --> Used to load a different Keras model if you have trained one for yourself. If you haven't trained your own model, do not use this flag.

--mode --> DEPRECATED. Can be set to "minimizers" or "whole", depending on how pileup images are encoded. Default is "minimizers"; do not change unless you create your own pileup images with pileup.py.

--paf --> Location of minimap .paf file. Only needed if you chose to not run minimap as part of this script.

--pileup --> Options are "generate" (default) and "load". Only set to "load" if you have used pileup.py to generate custom pileup images.

--segment_size --> The size of read segments to predict percent identity for, in either minimizers or bases. Only use if you have trained your own neural network model; in this case it must match with the parameters you used to train the model.

--window_size --> The size of windows containing read segments to predict percent identity for, in either minimizers or bases. Only use if you have trained your own neural network model; in this case it must match with the parameters you used to train the model.

Running MiniMap Yourself

You may want to control the MiniMap parameters and not use the default settings that we use. This is recommended only for advanced users that know what they are doing.

MiniMap does not by default output the locations of minimizers in reads. Because MiniScrub relies on this information for pileup image generation, we created a custom version of MiniMap that records that information. We have included that version of MiniMap in this repository. From the main jgi-miniscrub directory, do:

cd minimap/
make

This will run the MiniMap Makefile to install it properly. MiniScrub's standard usage of MiniMap is as follows:

minimap -M -w5 -L100 -m0 -k15 -f0 my-reads.fastq my-reads.fastq | gzip -1 > my-reads.paf.gz

Notice the -w and -k flags. These are most likely the ones you would want to modify. For instance you may choose to change -w5 to -w3 and -k15 to -k13. For full details on minimap usage, please consult its GitHub page: https://github.com/lh3/minimap2

Custom Pileup Image Generation with pileup.py

The pileup.py script affords extra options for generating pileup images that are not available in the main miniscrub.py script. It is generally not recommended to use this script, since several options are deprecated, the default behavior which works for most users is already in miniscrub.py, and this script will store one .png image file for each read, leading to a large amount of files to be stored. Nevertheless, if advanced users wish to explore the extra options included in this script, they may do so. Options are described below.

--color --> Specifies color encoding for the images. Options are "bw" and "rgb" (default). Black-and-white encoding (bw) runs a bit faster but encodes less information and is DEPRECATED. RGB is recommended.

--compression --> Specify "gzip" if the mapping and reads files are compressed with gzip (.gz extension).

--debug --> Runs a short "debug" test run. Useful for debugging but not needed for most users.

-k --> "K" setting for minimap. Default: 15. Only change this if you changed the "-k" argument of minimap to something other than 15.

--limit_fastq --> Limit number of reads to scan from the fastq reads file. Useful for debugging but not necessary for most users.

--limit_reads --> Limit number of reads to generate pileups for. Useful for debugging but not necessary for most users.

--mapping --> Path to the .paf file produced by minimap. Required.

--maxdepth --> Maximum depth of pileup images; in other words, number of matching reads to retain per reference read. Default: 24.

--mode --> Choices are "whole" and "minimizers" (default). "Whole" will generate images with one pixel per base, while "minimizers" generates images with one pixel per minimizer. Minimizer mode is faster and does not appear to lead to a decrease in performance, so it is strongly recommended. Whole mode is DEPRECATED.

--plotdir --> Directory to save pileup images in. Only active if --saveplots is used.

--processes --> Number of processes to use; pileup image generation is multiprocessed. Default is 16. Decreasing this will decrease memory usage but increase runtime; increasing will increase memory usage and decrease runtime.

--reads --> Path to .fastq reads file. Required.

--saveplots --> Use this flag to save the pileup image plots. Nothing to set, just specify "--saveplots".

--verbose --> Print status updates along the way. Default is to produce no output. Nothing to set, just specify "--verbose".

Training Your Own Model with cigarToPctId.py and train.py

The miniscrub.py script is an end-to-end solution for users who are content to use the pretrained model on their reads. However, some advanced users may wish to train their own model, perhaps because they want to train a model on a different sequencing technology such as PacBio or they want to explore different parameter settings that may improve results. This can be done with the cigarToPctId.py and train.py scripts. We provide details on how to use these scripts below, but we expect users who choose this option to be "advanced" in the sense that they have some familiarity with deep learning and understand the concept of minimizers.

cigarToPctId.py

cigarToPctId.py is a helper script used to generate ground truth labels for training a custom CNN model. This is required because training a model requires ground truth information. Before running this, you will need a mapping from the reads to reference genome(s) in .sam format. For this step, we recommend GraphMap (https://github.com/isovic/graphmap). This is intended for advanced users. Options are described below.

--amount --> Number of bases/minimizers per label. In other words, length of read segments to generate predictions for. Must be consistent with the --segment_size argument in other scripts. Default: 48.

--compression --> Set to "gzip" if sam file is compressed with gzip (.gz file extension).

--limit_length --> Optionally do not generate labels for reads above a certain length in bases. Useful if limited memory is an issue.

--limit_paf --> Optionally limit the number of reads to read from .paf file. Useful for debugging purposes, but not needed for most users.

--limit_sam --> Optionally limit number of lines to read from .sam file. Useful for debugging purposes, but not needed for most users.

--mode --> Set to "bases" or "minimizers" (default). Generate labels for a certain --amount of either bases or minimizers.

--outfile --> File to write labels to.

--paf --> Path to .paf file with output from minimap. Needed if --mode is set to minimizers.

--sam --> Path to .sam file with reads mapped to reference genome. Used for ground truth, and required.

train.py

train.py is the script that trains a model on input sequence data. It is intended for advanced users. You will need to have the output (labels) from cigarToPctId.py and a .sam file from aligning reads to a reference genome. We suggest GraphMap (https://github.com/isovic/graphmap) for producing the latter. Options are described below.

--baseline --> This will train an SVM on the minimizers data used by the CNN. This was used to produce the comparisons in the paper but is not useful for most users.

--classify --> Instead of regression (predict percent identity for read segment), turn into a classification problem, in which segments with accuracy above the specified accuracy are considered "positive" and those below are considered "negative". For instance, specify --classify 0.8. This tends to not be helpful.

--debug --> Train on a limited number of examples. Useful for debugging purposes.

--epochs --> Number of epochs to train the network, meaning how many times each training example is seen by the network. Default is 5.

--extra --> This can add extra fully connected layers (with 4096 neurons each) to the end of the neural network. May provide minor benefit at the cost of increased computation time. Default is 0 (no added layers).

--input --> Specify the directory with your custom pileup images.

--labels --> Specify the labels file (output from cigarToPctId.py).

--load --> Can be used to load a saved Keras model if you only want to load a model and test it on your data without training. Saved model can be produced using the --output argument.

--loadsvm --> Similarly to above, you can specify a saved SVM model, which can be produced using the --outputsvm argument.

--output --> Where to write trained Keras model. DOES NOT save your model if you do not use this flag.

--outputsvm --> Where to write trained SVM model, if you used --baseline. DOES NOT save your model if you do not use this flag.

--segment_size --> The size of read segments to predict percent identity for, in minimizers. Must be consistent with setting in cigarToPctId.py. Default: 48.

--streaming --> Instead of reading all data into memory and making all predictions at once, train by batches. This is highly recommended.

--test_input --> Can be used to specify a test set of saved pileup images to train a model specified with --load on.

--test_labels --> The labels for --test_input. Used for evaluation purposes.

--window_size --> The size of windows provided to the neural network containing read segments to predict percent identity for, in minimizers. Default: 72.

License Usage Information

The scripts in MiniScrub make use of a modified version of the Keras implementation of VGG.

We use TensorFlow as our Keras backend, which is under the Apache License (https://www.apache.org/licenses/LICENSE-2.0). We use a modified version of the Minimap code to output the lists of colinear minimizers; Minimap is available under the MIT license.