RaFAH results were empty

Issue #57 closed
gennuo wang created an issue

Dear iphop team,

Thank you for creating this good tool.

I am running the iPHoP version 1.3.2 on Linux. I add my own MAGs to the database. So, I use $ iphop predict --fa_file Input_viral_contigs.fasta --db_dir Sept_2021_pub_rw_w_Wetland_hosts/ --out_dir test_add_db -t 4 to run.

Currently, I got an issue: RaFAH results were empty. This may be ok but is still unusual, so you may want to check the rafah log (Wdir/rafah.log). For more details, you can check the attachments.

Could you please help me to check and give me some suggestions on how to solve it?

Thank you very much.

Best regards,

Wang

Comments (12)

  1. Simon Roux repo owner

    Hi Wang,

    It looks like this is an out-of-memory error (in rafah.log, you can see the line “Error: std::bad_alloc Ranger will EXIT now.”, which typically means R has ran out of memory). Unfortunately in its current implementation rafah can require a lot of memory even for a relatively small number of sequences. The recommendation would thus be to try to run on a machine with more memory and see if you do get rafah results this time.

    Best,

    Simon

  2. gennuo wang reporter

    Dear Simon,

    Thank you for your reply. I run it again. I get other issues as below:

    [1] "Passing data to Random Forest using 4 threads"
    Error: std::bad_alloc Ranger will EXIT now.
    Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :
    User interrupt or internal error.
    Calls: predict -> predict.ranger -> predict -> predict.ranger.forest
    Execution halted
    No such file or directory at /home/wangg/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/RaFAH_v0.3.pl line 313.
    Parsing output of host prediction /work/wangg/iphop2023/Data_test_add_to_db/outputs/Wdir/rafah_out/Full_Host_Predictions.tsv

    Do you have any ideas to solve this?

    Thank you very much.

    Best,

    Wang

  3. Simon Roux repo owner

    This looks like the same error (“std::bad_alloc”). How many sequences are you trying to process, and do you know how much memory you have available ?

  4. gennuo wang reporter

    Dear Simon,

    I have 88 sequences trying to process. I have 350G memory available.

    Thank you.

    Best,

    Wang

  5. Simon Roux repo owner

    Hi Wang,

    Ok, so you have more than enough resources, so I’m not sure why Ranger is not happy. In my experience, this kind of error (when not linked to e.g. memory limitation) come from a single problematic input sequence. So my suggestion would be:

    • run the test set provided with the tool (https://bitbucket.org/srouxjgi/iphop/src/main/test/test_input_phages.fna) and see if you get the same issue
    • If the test set if fine, then you can split your set of 88 into two groups of 44, and check that you get this error with only 1 of the 2 groups. If that’s teh case, keep dividing the problematic batch into two groups until you find the culprit.

    Sorry, it’s not the most fun exercise, but it’s the most straightforward way I can see out of this,

    Best,

    Simon

  6. gennuo wang reporter

    Dear Simon,

    Thank you very much for your kind help.

    I try to run the test set provided by the tool, and it works. So, I split my set of 88 into two groups of 44, I got the same issues again.

    Then I add my MAGs to the database again, I got the other issue: Can't find MAG** in the trees, so can't calculate distances. I try to solve the issue as you told other people before.

    • In the new host database folder, look into “db_infos/Host_Genomes.tsv”, and remove all the lines that correspond to MAG that are not in the trees (you can make a copy of the file before just in case). The key part is: if a genome is not listed in this file, iPHoP should not try to include it later on
    • Once this is done, you need to go into the output folder of iPHoP, then into “Wdir”, and remove all the files ending with “…parsed.csv”. Then run iPHoP again.

    Finally, it works. I got the results.

    I have 88 viral contigs and 307 of my own MAGs. But in the Host_prediction_to_genome_m90.csv, it just identified 9 viral contigs that belong to 17 hosts ( 7 of my own MAGs). I just wondering if it is normal or not.

    Do you have any ideas about this?

    Thank you very much.

    Best,

    Wang

  7. Simon Roux repo owner

    Hi Wang,

    Glad to hear that you managed to get your results in the end ! Regarding the number of predictions (9 viral contigs with host predictions out of an input of 88), even if the tool works as expected, there are a few potential reasons for this:

    • The length/completeness of your virus sequences will influence the number of predictions you get. Shorter sequences (e.g. 5kb vs 50kb), and/or the lower completeness (i.e. working with ~ 10% of the genome vs 90%) will yield fewer results. You can see that in Fig. S12 of the paper (https://journals.plos.org/plosbiology/article/file?type=supplementary&id=10.1371/journal.pbio.3002083.s012), where the recall decreases substantially with short sequences.
    • The environment your sequences were sampled from also matters. This is something you can see in Fig. 4 https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002083#pbio-3002083-g004 : sequences from human gut microbiomes for instance get a lot more predictions than sequences from soil, and that has to do with the quality of the reference databases.
    • Finally, you can somewhat work around these limitations by using a different score cutoff. By default, iPHoP uses a minimum score of 90, which is relatively conservative but recommended to get predictions at the genus level. In my experience, a score of 75 can be used especially for cases like yours where you don’t get a lot of predictions above 90. In that score range (~ 75 - 90), I typically recommend to interpret the host prediction at the family level rather than the genus level, but it’s still something. To get the same output files with a minimum score of 75, you can just rerun the same command you did (using the same output folder), but adding “--min_score 75”. iPHoP will re-use all the existing results, and generate new output files with “_m75” suffix, i.e. using a cutoff of 75 (this should be pretty quick).

    Let me know if you have any question !

  8. gennuo wang reporter

    Dear Simon,

    Thank you very much for your patience and help.

    My sequences come from the soil indeed. We got more viruses and hosts when I change the score to 75. I have 22 viral contigs with host predictions out of an input of 88 and 79 unique hosts were predicted (including 3 of my own MAGs). I am wondering if my MAGs are the same hosts with iphop supported host database, but they have different Host genome names. How iphop manage them? If my MAGs be deleted?

    Thank you very much.

    Best,

    Wang

  9. Simon Roux repo owner

    Hi Wang,

    That’s a good question. Your MAGs will not be deleted, however if they are closely related to a GTDB genome, this GTDB genome will be used as the representative genome instead of your MAG. You can check this in the “Host_Genomes.tsv” and look for your MAGs there: the 5th column is the representative genome used by iPHoP for this MAG.

    Best,

    Simon

  10. gennuo wang reporter

    Dear Simon,

    Thank you for your help.

    I checked the file, I found there 299 of my own MAGs. If this mean these 299 MAGs will also be run in the wish?

    Actually, I got my 88 viral contigs after CheckV, then I ran the Wish alone. These viral contigs can pair 65 unique MAGs. I am wondering if iphop identifies the completed viral contigs more precisely than CheckV? Do you think I need to change the minimum score less 75?

    Thank you very much.

    Best,

    Wang

  11. Simon Roux repo owner

    Hi Wang,

    MAGs are only added to WIsH is they are new species representatives. If they are clustered into an existing species-level group, then only the representative genome for this group is used in WIsH. The problem or running WIsH alone is that the score from WIsH is not reliable by itself to distinguish correct from incorrect predictions, this is why iPHoP was designed to take into account multiple tools for each input sequence. iPHoP does not identify complete viral contigs, for this CheckV is the tool you should rely on. Finally in terms of score, you can certainly lower your cutoff to 75, in which case I would encourage to interpret the host prediction at the family level rather than the genus level.

    Hope this helps !

    Best,

    Simon

  12. Log in to comment