key Error: with custom db

Issue #37 closed
Grégoire M created an issue

Hi

I’m using version 1.3.1

I got a similar issue as https://bitbucket.org/srouxjgi/iphop/issues/30/keyerror-3300006428_5_vs_rs_gcf_0087277351

except that it happens for a custom database. I used the latest version of GTDTK for the denovo option.

I also got the same error with another custom database

Here are the last lines of the log file (with the --debug option)

Also, the log file is 2.4 Gb large with Processing written 90 million times, don’t know if it’s related…

        Processing OTE_20_concoct_167_sub
        Processing OTE_20_metabat_38
        Processing OTE_64_metabat_289
        Processing VAR_48_metabat_265_sub
        Processing OTE_38_metabat_126
        Processing SOY_34_metabat_33
        Processing VAR_17_metabat_392
        Processing VAR_34_metabat_152
        Processing SOY_34_metabat_159
        Processing VAR_61_metabat_364
        Processing OTE_20_metabat_260_sub
        Processing VAR_83_metabat_510

iphop_db_ensemble/db/rewish_models_extra/Batch_extra.pkl LL processed, now we calculate the p-values and export
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'VAR_4_metabat_244'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 208, in process_batch
rewish_results = add_pvalues(rewish_results,ref_file)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 221, in add_pvalues
rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 8740, in apply
return op.apply()
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/apply.py", line 688, in apply
return self.apply_standard()
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/apply.py", line 812, in apply_standard
results, res_index = self.apply_series_generator()
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
results[i] = self.f(v)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 221, in <lambda>
rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 227, in transform
ref_row = ref_mat.loc[host,['Average','Stdev']]
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 925, in getitem
return self._getitem_tuple(key)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 838, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1164, in _getitem_axis
return self._get_label(key, axis=axis)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1113, in _get_label
return self.obj.xs(label, axis=axis)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/generic.py", line 3776, in xs
loc = index.get_loc(key)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'VAR_4_metabat_244'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/work/river/Software/miniconda3/envs/iphop_env/bin/iphop", line 10, in <module>
sys.exit(cli())
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli
args"func"
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 87, in main
wish.run_and_parse_wish(args)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 48, in run_and_parse_wish
run_rewish(args["fasta_file"],extra_raw_results,args["wish_db_dir_extra"],extra_negfit,extra_out_tmpdir,threads_tmp)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 156, in run_rewish
async_parallel(process_batch, args_list, threads)
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 245, in async_parallel
return [r.get() for r in results]
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 245, in <listcomp>
return [r.get() for r in results]
File "/work/river/Software/miniconda3/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
KeyError: 'VAR_4_metabat_244'

Best

greg

Comments (14)

  1. Grégoire M reporter

    Hi,

    If it helps, I looked a bit and apparently a potential issue is that this MAGs and about quite a lot of others are absent from the file Wish_extra_negFits.csv and also likelihood.matrix

    Best

    Greg

  2. Simon Roux repo owner

    Hi Greg,

    Thanks for reporting this issue. I don’t think this is similar to https://bitbucket.org/srouxjgi/iphop/issues/30/keyerror-3300006428_5_vs_rs_gcf_0087277351 (this one was fixed in 1.3.1), but it does look similar to https://bitbucket.org/srouxjgi/iphop/issues/36/wish-keyerror (custom database). I need to look into this, but I think you are on the right track with some MAGs missing from “Wish_extra_negFits.csv” (MAGs missing from likelihood.matrix is ok, only the “best hits” will be in there).

    Most likely, this means the issue was at the database creation step, not the iPHoP processing step. Do you still have the log for this custom database creation ? If not, what may be worth trying is:

    • In the new custom database “db” folder, remove the directory “rewish_models_extra”
    • In the new custom database “db_infos” folder, remove the file “Wish_extra_negFits.csv”
    • Re-run the “add_to_db” script, with “--debug” (and store the standard output & standard error).

    What I suspect is happening is that something goes wrong when generating the file “Wish_extra_negFits” (and mabye the “rewish_models_extra” but not sure), but the “add_to_db” script does not catch it and acts as if it had succeeded, which leads to failure when running the prediction with this database. Since this add_to_db step works well on my side, we’ll need these logs of “add_to_db” and/or I would need to try to run it with your MAGs to understand what’s going on.

    Thanks !

  3. Grégoire M reporter

    Hi Simon

    I attached the add_dband the predict log for you to check them.

    Also I check the Batch_extra.pkl file with this script

    import pickle
    
    with open('Batch_extra.pkl', 'rb') as f:
        dat = pickle.load(f)
    

    and inside all the MAGs are present, including

    {'Name': 'GFS_134', 'LL': array([-1.2144441 , -0.98082925, -2.21297293, ..., -1.38629436,
    -1.38629436, -1.38629436])}

    Best

    Greg

  4. Simon Roux repo owner

    Great, good news to have all the MAGs in “Batch_extra.pkl”. Can you check if all the MAGs are also present in the Wish_extra_negFits.csv file ? Never mind, this was checked before, you mentioned it already ..

  5. Simon Roux repo owner

    So that is consistent with the log you shared, i.e. GFS_134 is seen as “processed”, however the resulting table which should have all the MAGs (with log-likelihood average and standard deviation) seems incomplete, as it only has 1,450 rows, which I suspect are the ones you also see in Wish_extra_negFits.csv (the standard deviation is also “NaN” for a number of these MAGs, which is weird).
    Since this only seems to happen in some cases (possibly linked to the size of MAGs, but could also be some specifics of the MAGs themselves, e.g. sequence name format, etc), would you be ok sharing the input to iPHoP add_to_db (i.e. a tar archive of the GTDB-tk result and of the fasta file folder), so I can try to recreate the bug on my side ? If that’s ok, you should be able to uploade them to https://drive.google.com/drive/folders/1L72nXsUOdFo-6xM0KjNgia1Vz-LRZ8YV?usp=sharing

    Thanks !

  6. Simon Roux repo owner

    Hi Greg,

    Thanks again for sharing these files. I think I understood what was happening (it was a bug in the way the WIsH references were built when adding “too many” custom MAGs). I have fixed this bug and released a new version of iPHoP to bioconda. Can you try to update your version (“conda install iphop=1.3.2”), and then try to re-build the custom databases ? Unfortunately, it is probably best to completely rebuild the database from the GTDB-tk output and the fasta files, but the good news is that, this time, you should see the 2,739 MAGs in the file “Wish_extra_negFits.csv”, and then iPHoP predict on this database should work. I also changed the way “add_to_db” uses multiple threads, so hopefully it is quicker than it used to be (on my side, building a custom database with your MAGs took a little more than 2 hours using one node and 32 threads).

    Let me know when you had a chance to test this, and hopefully this fix the issue on your side as well !

    Best,

    Simon

  7. Simon Roux repo owner

    Perfect, thanks for checking so quickly, and for the detailed bug report which helped us figure out quickly what was happening !

  8. Log in to comment