Run only specific steps (after step 8)

Issue #8 resolved
Former user created an issue

Hi Simon, Hope you are doing well and thank you for developing this great tool :-) I'm not sure if i can put this kind of question as this is not program issue. I run a quite huge dataset through your tool (~20K viral contigs), however it fails after two days due to timelimit (with 32 CPUs) of our server. I should put 3 days... Anyway, my question is that there is a way to run only after certain step? All steps were done except after step 8. Here is the error message.

[8/1.2] Run blast classifier Model_blast_Conv-87 (by batch).. slurmstepd: error: JOB 691099 ON skylake-f32-01 CANCELLED AT 2022-09-14T22:25:30 DUE TO TIME LIMIT

Best wishes Mia

Comments (7)

  1. Sungeun LEE

    Hi again me,

    I finally found that in the master_predict.py, --step classify option allow to start with step 7 :-), so ignore my previous message.

    Best wishes

    Mia

  2. Simon Roux repo owner

    Hi Mia,

    That is correct, “--step classify” will bypass all the previous steps. iPHoP also always tries to re-use existing files by default, which can help in this kind of case, but sometimes causes issues if an incomplete output file was produced in the failed run. So if you have unexpected results or error, my solution is typically to go into the “Wdir” folder and remove the latest files (this is getting a bit into the inner working of iPHoP though, so if you’re unsure feel free to email me and I can help identify which file is likely incomplete)

    Best,

    Simon

  3. Sungeun LEE

    Good morning Simon and thank you very much for your quick reply ;)

    I’ve checked all created folders & output file and “--step classify” worked well but i had some other issue during final step; combining all results step..

    [9/2] Combining all results (Blast, CRISPR, iPHoP, and RaFAH) in a single file: iphop_output/Wdir/All_combined_scores.csv

    Traceback (most recent call last):

    File "/home/ampere/slee/anaconda3/envs/iphop_env/bin/iphop", line 10, in <module>

    sys.exit(cli())

    File "/home/ampere/slee/anaconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 121, in cli

    args["func"](args)

    File "/home/ampere/slee/anaconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 102, in main

    runaggregatormodel.run_model(args)

    File "/home/ampere/slee/anaconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/runaggregatormodel.py", line 65, in run_model

    merged = merge_all_results(args)

    File "/home/ampere/slee/anaconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/runaggregatormodel.py", line 238, in merge_all_results

    rafah_results = rafah.filter_rafah(rafah_results,args)

    File "/home/ampere/slee/anaconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/rafah.py", line 165, in filter_rafah

    rafah_full_clusters = os.path.join(args["rafah_out_dir"], "Full_Genome_to_OG_Score_Min_Score_50-Max_evalue_1e-05_Prediction.tsv")

    KeyError: 'rafah_out_dir'

    I was wondering if this maybe is related to an error from step 8 ? This error came up, but the run was still going on up to step 9:

    [8/1.1] Getting blast-based scores..

    2022-09-15 13:16:02.897672: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot ope

    n shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ampere/slee/anaconda3/envs/iphop_env/x86_64-conda-linux-gnu/lib/:/home/ampere/slee/anaconda3/env

    s/iphop_env/lib/:

    2022-09-15 13:16:02.897714: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)

    2022-09-15 13:16:02.897734: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cascade-f32-07): /proc/driv

    er/nvidia/version does not exist

    2022-09-15 13:16:02.898147: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use

    the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA

    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

    Best wishes

    Mia

  4. Simon Roux repo owner

    Hi Mia,

    So, as far as I know, the messages you see are only “warnings”, i.e. you may not get the best performances from Tensorflow but it should still produce the expected output. The error you have in the step 9.2 suggests, however, that RaFAH did not finish (the final output file from RaFAH does not exists). This can happen especially because of “out of memory” issues when running RaFAH, which are not always captured well by the error logs.

    A few things you could check / try:

    • You can look into the file “Wdir/rafah.log” to see what RaFAH did, and if it had any error (the last line should be “Printing results to … “
    • Because the dataset is so big (20k contigs), it is likely that you will run into memory / walltime issues. Our advice is typically to split the dataset into smaller batches (e.g. 1k contig) and process these separately. This does not change anything to the results (all virus sequences are processed independently) but can alleviate these issues. We also included a small utility to split an input fasta file into smaller batches (iphop split --input_file <fasta file of the contigs> --split_dir <empty directory where individual batches will be written> [options]).

    Hope it helps !

    Best,

    Simon

  5. Sungeun LEE

    Hi Simon ;)

    Good news, as you said the error was due to the out of memory during RaFAH. I rerun the job with more memory.

    By increasing the memory to 768GB (maximum for our server), i was able to finish the job even with 20k contigs :)

    Now, i’m very excited to see the result and thanks again for your help ;-)

    Mia

  6. Log in to comment