Adding MAGs to standard database - [6] Add new genomes to VHM database...

Issue #34 closed
Docente EAD Oscar Salgado created an issue

Hello,

Thanks so much for this tool.

I am adding 1765 MAGs from 37 samples to the standard database. Everything looks ok ( I appreciate the detailed instructions), but It has been running for three days at step 6 ([6] Add new genomes to VHM database...), so I would ask if that is expected.

Thank you very much.

Regards.

Comments (9)

  1. Simon Roux repo owner

    Hi Oscar,

    1,765 MAGs is a lot to add, however 3 days also seems relatively long for the VHM step. What would be more likely is that the log is not (yet) updated, but the long database creation step is the one for WIsH ? You should certainly keep an eye on this, and copy over the log to this issue if the program never finishes.

    Best,

    Simon

  2. Docente EAD Oscar Salgado reporter

    Hi. Thank you for your answer.

    Finally, I stopped it. The only log file that I can see is in the rewish_tmp folder, called wish.log

    __

    Processing /media/ddbb/iphop_db/Sept_2021_pub_rw/db/wish_data/Decoy_db/Decoy_phages.fna -- compiling kmers and matching to hosts in 1 batches
    processing batch 1
    loading virus kmers for 1 to 460
    Processing all host packages in /media/ddbb/iphop_db/Sept_2021_pub_rw_hotspring/db/rewish_models_extra/
    Compiling individual batches results from /media/ddbb/iphop_db/Sept_2021_pub_rw_hotspring/db/rewish_tmp/wish_results into /media/ddbb/iphop_db/Sept_2021_pub_rw_hotspring/db/rewish_tmp/llikelihood.matrix
    wish.log (END)

    __

    Thinking that MAGs numbers could be the problem, I would try two groups of MAGs. Feeding the second group with the database resulting from the first group. I appreciate your feedback.

    Best regards.

  3. Simon Roux repo owner

    Sounds good, I think it’s definitely worth starting with a few MAGs just to make sure the pipeline works. If it does, then it’s a number issue indeed, unfortunately.
    The problem with the option of adding one custom database on top of another is that it may not work (this has never been tested). I think if your test with a few MAGs work, your better options is to dedicate more threads to the “add_to_db” script (ideally run with 32 or 64 threads even), this should speed things up.

  4. Docente EAD Oscar Salgado reporter

    Dear Simon,

    Finally I ran the script successfully adding MAGs. I changed two stuff. I shortened the paths and remove any other extension file from the working directory.

    Also, I can run the standard database without problems. However, when I try to run the extended database (iphop predict --fa_file /media/oscarwd/hotspring_vmags/vrhyme_results/concatenated_vmags_vcontigs/cat_vmags_vcontigs_37mg.fasta --db_dir /media/ddbb/refineM_MAGS_hotsprings/Sept_2021_pub_rw_37mg_1.3.1/ --out_dir iphop_out_37mg_db_v1.3.1_intento2/ --num_threads 32 --debug), the process says:

    ___
    [3/1/Run] Running WIsH extra database...
    multiprocessing.pool.RemoteTraceback:
    """
    Traceback (most recent call last):
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
    File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
    File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'M65_SRR5580902_DOE_057_rm'

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 208, in process_batch
    rewish_results = add_pvalues(rewish_results,ref_file)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 221, in add_pvalues
    rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/frame.py", line 8740, in apply
    return op.apply()
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/apply.py", line 688, in apply
    return self.apply_standard()
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 221, in <lambda>
    rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 227, in transform
    ref_row = ref_mat.loc[host,['Average','Stdev']]
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexing.py", line 925, in getitem
    return self._getitem_tuple(key)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
    return self._getitem_lowerdim(tup)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexing.py", line 838, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexing.py", line 1164, in _getitem_axis
    return self._get_label(key, axis=axis)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexing.py", line 1113, in _get_label
    return self.obj.xs(label, axis=axis)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/generic.py", line 3776, in xs
    loc = index.get_loc(key)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
    KeyError: 'M65_SRR5580902_DOE_057_rm'
    """

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/bin/iphop", line 10, in <module>
    sys.exit(cli())
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli
    args"func"
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 87, in main
    wish.run_and_parse_wish(args)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 48, in run_and_parse_wish
    run_rewish(args["fasta_file"],extra_raw_results,args["wish_db_dir_extra"],extra_negfit,extra_out_tmpdir,threads_tmp)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 156, in run_rewish
    async_parallel(process_batch, args_list, threads)
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 245, in async_parallel
    return [r.get() for r in results]
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/site-packages/iphop/modules/wish.py", line 245, in <listcomp>
    return [r.get() for r in results]
    File "/home/osalgado/anaconda3/envs/iphop_1.3.1/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
    KeyError: 'M65_SRR5580902_DOE_057_rm'

    ___

    I think removing that MAG could be the short solution. But Really I don't know.

    I appreciate your help.

    Regards.

  5. Simon Roux repo owner

    Hi Oscar,

    You don’t need to remove any MAG for now. This looks like the same a bug that was reported a few days ago for custom databases. I have a fix in a new version (1.3.2) that is being uploaded to bioconda right now. I will let you know as soon as it’s available for you to download and test.

    Best,

    Simon

  6. Simon Roux repo owner

    Hi Oscar,

    There is a new version on bioconda (iPHoP v1.3.2) in which this bug should be fixed. Please update your iPHoP install (“conda install iphop=1.3.2”), and rebuild your custom database (you will unfortunately need to start from scratch here, i.e. re-run the “add_to_db” part). With the new custom database built with iPHoP v1.3.2, you should not see this error anymore.

    ‌Let me know if it works !

    Thanks,

    Simon

  7. Docente EAD Oscar Salgado reporter

    Hi Simon,

    I follow your advice and everything is ok now. I obtain the gtdb files with gtdbtk-2.3.0 R214 and added the 1696 MAGs to the iPhOP database. Actually, I have results for standard and custom databases.

    Thank you very much for your work here and for this great tool.

    Best regards.

  8. Log in to comment