Rafah not working

Issue #104 closed
Carlos Naf created an issue

Tried the test file with the test_db. It does seem it tries to compute something, but then, at the end “but the file iphop_test_results/test_input_phages_iphop/Wdir/rafahparsed.csv has no data

I installed it with mamba, I did not observe any issue while installing, except for some warnings (this command will be depreciated in future versions kind of).

I also had to modify the file “data prep_rf.py”, remove quotations as happened in some other issue.

Traceback (most recent call last): File "/home/csic/mgr/mgr/.conda/envs/iphop_env/bin/iphop", line 0, in <module> sys.exit(cli()) File "/home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli args["func"](args) File "/home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 105, in main dataprep_rf.aggregate_rf(args) File "/home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/dataprep_rf.py", line 35, in aggregate_rf compute_matrices(df_blast,df_crispr,df_labels,args) File "/home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/dataprep_rf.py", line 154, in compute_matrices selected_blast = selected_blast.sort_values(by = ["Dist","N match","Id %"], ascending = ["False","False","False"]) File "/home/csic/mgr/mgr/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/csic/mgr/mgr/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 6286, in sort_values ascending = validate_ascending(ascending) File "/home/csic/mgr/mgr/.local/lib/python3.8/site-packages/pandas/util/_validators.py", line 435, in validate_ascending return [validate_bool_kwarg(item, "ascending", **kwargs) for item in ascending] File "/home/csic/mgr/mgr/.local/lib/python3.8/site-packages/pandas/util/_validators.py", line 435, in <listcomp> return [validate_bool_kwarg(item, "ascending", **kwargs) for item in ascending] File "/home/csic/mgr/mgr/.local/lib/python3.8/site-packages/pandas/util/_validators.py", line 251, in validate_bool_kwarg raise ValueError( ValueError: For argument "ascending" expected type bool, received type str.

Once this was corrected, it finished the pipeline, but Rafah did nothing.

Comments (8)

  1. Simon Roux repo owner

    Hi Carlos,

    Sorry to hear about RaFAH / iPHoP scripts causing issues. If you re-ran iPHoP with the same output folder, and the file rafahparsed.csv was already created (even if empty), iPHoP will not try to re-run RaFAH. If that’s the case, I would try to re-run on a completely new output folder, and hopefully this time RaFAH produces the expected output ?

  2. Simon Roux repo owner

    Actually re-reading this, “data_prep_rf.py” should happen after RaFAH, so the two issues are likely unrelated (although I’m confused why you saw an issue with quotation marks, but glad you could fix it). For RaFAH, you can look into “rafah.log” to get the trace, but most often this type of issue happens because RaFAH runs out of memory and the process stops midway through. Are you running on e.g. a laptop maybe ? (I know that happens in my smaller laptop).

  3. Carlos Naf reporter

    Thank you for answering!

    Im running it in a computer cluster (sometimes is complex to make things work), atm im using 16 GB mem and eight cores. Should that be enough?

    edit: tried with more memory and the same issue.

    In any case, I checked the rafah.log:

    Running host prediction mode
    Indexing sequences from prueba_iphop/Wdir/split_input/
    Processing AJ421943.1.fasta
    Processing CP017905.1.fasta
    Processing IMGVR_UViG_3300013274_000001.fasta
    Processing IMGVR_UViG_3300013456_000001.fasta
    Processing MT657335.1.fasta
    Processed 5 Genomic Sequences
    Running Prodigal
    Indexing sequences from prueba_iphop/Wdir/rafah_out/Full_CDS_Prediction.faa
    Running hmmsearch. Query: prueba_iphop/Wdir/rafah_out/Full_CDS_Prediction.faa DB: test_db/Test_db_rw/db/rafah_data/HP_Ranger_Model_3_Filtered_0.9_V
    alids.hmm
    Obtained 43644 ids from test_db/Test_db_rw/db/rafah_data/HP_Ranger_Model_3_Valid_Cols.txt
    Parsing prueba_iphop/Wdir/rafah_out/Full_CDSxClusters_Prediction
    Detected 653 OGs across 5 genomic sequences
    Performing host prediction
    r_script_predict_file_name
    /home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/RaFAH_src/RaFAH_Predict_Host.R
    r_model_file_name
    test_db/Test_db_rw/db/rafah_data/MMSeqs_Clusters_Ranger_Model_1+2+3_Clean.RData
    genomexog_table_file_name
    prueba_iphop/Wdir/rafah_out/Full_Genome_to_OG_Score_Min_Score_50-Max_evalue_1e-05_Prediction.tsv
    host_pred_file_name
    prueba_iphop/Wdir/rafah_out/Full_Host_Predictions.tsv
    [1] "Loading Model from  test_db/Test_db_rw/db/rafah_data/MMSeqs_Clusters_Ranger_Model_1+2+3_Clean.RData"
    No such file or directory at /home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/RaFAH_v0.3.pl line 321.
    Parsing output of host prediction prueba_iphop/Wdir/rafah_out/Full_Host_Predictions.tsv
    

    Checked the RaFAH_v0.3.pl, the predict_host function - system().

    The prueba_iphop/Wdir/rafah_out/Full_Host_Predictions.tsv file is missing.

    Edit again: it has something to to with perl (it should had been obvious, previous issues have dealt with this…) although I did not receive any error msg, gonna try to solve this.

  4. Simon Roux repo owner

    More than 16Gb of RAM should be enough, in theory. I’m not sure if this has to do with Perl or with R (iPHoP, in python, call the RaFAH perl script, which itself calls an R script :-( we have plans/hopes to simplify it in the future but.. this is where we are right now, sorry about that !).

    The missing file is the one that the R part should generate, so I would maybe try to run the R call separately in a terminal to get all the output and see what happens ? You should be able to recreate the command line based on the line “system("Rscript $r_script_predict_file_name $r_model_file_name $genomexog_table_file_name $host_pred_file_name $threads");”, i.e. it should look something like:

    $ conda activate iphop
    
    $ Rscript /home/csic/mgr/mgr/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/RaFAH_src/RaFAH_Predict_Host.R test_db/Test_db_rw/db/rafah_data/MMSeqs_Clusters_Ranger_Model_1+2+3_Clean.RData prueba_iphop/Wdir/rafah_out/Full_Genome_to_OG_Score_Min_Score_50-Max_evalue_1e-05_Prediction.tsv prueba_iphop/Wdir/rafah_out/Full_Host_Predictions.tsv 8
    

  5. Carlos Naf reporter

    Thank you very much, now its working!

    On the one hand, your suggestion made me redownload the test databases, and I could observe further progress, somehow the file was corrupted (I mv and cp databases, guess something happened then).

    But at the end it seems it was a memory problem. Ive used up to 128 gb memory and it worked… guess it might work with 64 as well. However job efficiency report said I used max 17 Gb. Is it possible that this R script, which is relatively simple and is handling relatively small things, can consume so much resources ? When using 32, executing just the Rscript ended up in “Killed”

  6. Simon Roux repo owner

    Right, the Rscript ending up with “killed” in my experience is indeed running out of memory. And yes, unfortunately, it’s loading a large data frame in memory (even for processing a single virus), and it consumes much more resource than it could / should. We have on our list to hopefully re-implement the same logic directly in python, that would remove the need to go through a perl script and a r script, but in the meantime, you’ll need to allocate probably 64Gb memory at least. The good news is: this is not dependent on your input size, so it won’t increase even when you switch to the complete database and processing of hundreds of input sequences.

    Best,

    Simon

  7. Log in to comment