Missing genome_by_genome_overview.csv in output directory

Issue #14 resolved
Chris created an issue

Hi,

I’m trying to run 355 phage sequences through vContact2, however although the run appears to complete without any errors I do not get the genome_by_genome_overview.csv. Below is the command I’ve been using:

vcontact -t 30 --raw-proteins all_prophage_proteins.faa --rel-mode "Diamond" --proteins-fp gene_to_genome.csv --db "ProkaryoticViralRefSeq97-Merged" --pcs-mode MCL --vcs-mode ClusterONE --c1-bin /home/Documents/Programs/MAVERICLab-vcontact2-6d6fe8cf260a/bin/cluster_one-1.0.jar --output-dir vContact_output_97 &> vContact_97.log

and this is the output I get after running the vContact command above:

============================This is vConTACT2 0.9.13============================



----------------------------------Pre-Analysis----------------------------------


------------------------------Reference databases-------------------------------


-------------------------------Protein clustering-------------------------------


----------------------------------Loading data----------------------------------


--------------------------------Adding Taxonomy---------------------------------


------------------------Calculating Similarity Networks-------------------------
Loaded graph with 2901 nodes and 127532 edges
[====================] 100% Growing clusters from seeds...
[====================] 100% Finding highly overlapping clusters...
[====================] 100% Merging highly overlapping clusters...
Detected 351 complexes
.................................................. 1M
.................................................. 2M
..................
[mcl] new tab created
[mcl] pid 13812
 ite -------------------  chaos  time hom(avg,lo,hi) m-ie m-ex i-ex fmv
  1  ................... 126.06  1.99 0.96/0.01/12.04 5.63 1.85 1.85  71
  2  ...................  82.16  4.03 0.72/0.01/4.91 10.17 0.11 0.21  90
  3  ...................   8.43  0.30 0.90/0.07/10.78 1.73 0.22 0.05  14
  4  ...................   1.79  0.03 0.98/0.37/14.44 1.02 0.58 0.03   1
  5  ...................   1.04  0.02 0.99/0.50/8.41 1.00 0.88 0.02   0
  6  ...................   0.31  0.02 1.00/0.58/1.30 1.00 0.96 0.02   0
  7  ...................   0.23  0.02 1.00/0.77/1.00 1.00 0.99 0.02   0
  8  ...................   0.23  0.02 1.00/0.78/1.00 1.00 1.00 0.02   0
  9  ...................   0.00  0.02 1.00/1.00/1.00 1.00 1.00 0.02   0
 10  ...................   0.00  0.02 1.00/1.00/1.00 1.00 1.00 0.02   0
[mcl] cut <1> instances of overlap
[mcl] jury pruning marks: <97,99,99>, out of 100
[mcl] jury pruning synopsis: <97.8 or superb> (cf -scheme, -do log)
[mcl] output is in vContact_output_97/modules_mcl_5.0.clusters
[mcl] 765 clusters found
[mcl] output is in vContact_output_97/modules_mcl_5.0.clusters

Please cite:
    Stijn van Dongen, Graph Clustering by Flow Simulation.  PhD thesis,
    University of Utrecht, May 2000.
       (  http://www.library.uu.nl/digiarchief/dip/diss/1895620/full.pdf
       or  http://micans.org/mcl/lit/svdthesis.pdf.gz)
OR
    Stijn van Dongen, A cluster algorithm for graphs. Technical
    Report INS-R0010, National Research Institute for Mathematics
    and Computer Science in the Netherlands, Amsterdam, May 2000.
       (  http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z
       or  http://micans.org/mcl/lit/INS-R0010.ps.Z)

'Pseudomonas~virus~D3'


------------------------Contig Clustering & Affiliation-------------------------


--------------------------------Protein modules---------------------------------


---------------------------Link modules and clusters----------------------------


----------------------------Exporting results files-----------------------------
There were 564 genomes (including refs) that were singleton, outlier or overlaps.

I think I get all the other files produced in the output directory except that final file. These are the files I get:

$ ls vContact_output_97/
c1.clusters    merged.self-diamond.tab                 modules_mcl_5.0.clusters        sig1.0_mcl2.0_contigs.csv                                          vConTACT_profiles.csv
c1.ntw         merged.self-diamond.tab.abc             modules_mcl_5.0_modules.pandas  sig1.0_mcl2.0_modsig1.0_modmcl5.0_minshared3_link_mod_cluster.csv  vConTACT_proteins.csv
merged_df.csv  merged.self-diamond.tab.mci             modules_mcl_5.0_pcs.pandas      sig1.0_mcl5.0_minshared3_modules.csv                               viral_cluster_overview.csv
merged.dmnd    merged.self-diamond.tab_mcl20.clusters  modules.ntwk                    vConTACT_contigs.csv
merged.faa     merged.self-diamond.tab_mcxload.tab     sig1.0_mcl2.0_clusters.csv      vConTACT_pcs.csv

My phage are present in these files (e.g. the c1.clusters, c1.ntw, and viral_cluster_overview.csv) and have been assigned clusters, so all looks fine as far as I can tell. There just isn’t a genome_by_genome_overview.csv.

Thanks in advance for any help you can give and also thanks for making this tool… minus this little issue I love it!

Comments (21)

  1. Julian Zaugg

    Hi, has there been any progress in resolving this issue? I too have failed to get the genome_by_genome_overview.csv included in the output (similar run parameters as those described above). No errors reported in the log file.

  2. Ben Bolduc

    Hi Chris and Julian,

    Could you attach the viral_cluster_overview file, either here or email (bolduc.10 at osu edu)? I haven’t been able to reproduce this issue, yet it’s still a lingering issue for several people. The weird aspect of this is that the genome_by_genome file is a re-formed version of viral_cluster_overview.

    Thanks for giving vConTACT2 a shot with your research!

  3. HandymanAlan

    Hi Ben, I just ran into this problem too. Didn’t happen with a smaller sample size (~200 contigs), but I just ran it with >1000 contigs and there’s no genome_by_genome_overview.csv.

    There’s no error in the log too.

    I will send you the viral_overview_cluster.

    Cheers

    Alan

  4. Stanley Ho

    “ERROR:vcontact2: Error in exporting the final summary data: first argument must be string or compiled pattern”

    This is the error I get at the end of the run

  5. Susheel Bhanu Busi

    @Ben Bolduc Has there been any update on this issue? I ran vConTACT2 as well, and don’t have the genome_by_genome_overview.csv file. And it doesn’t show if the run was complete or incomplete. Just have the following:

    Thu Jul  2 00:38:55 CEST 2020
    
    ============================This is vConTACT2 0.9.17============================
    
    
    
    ----------------------------------Pre-Analysis----------------------------------
    
    
    ------------------------------Reference databases-------------------------------
    
    
    -------------------------------Protein clustering-------------------------------
    
    
    ----------------------------------Loading data----------------------------------
    
    
    --------------------------------Adding Taxonomy---------------------------------
    
    
    ------------------------Calculating Similarity Networks-------------------------
    
    
    ------------------------Contig Clustering & Affiliation-------------------------
    
    
    --------------------------------Protein modules---------------------------------
    
    
    ---------------------------Link modules and clusters----------------------------
    
    
    ----------------------------Exporting results files-----------------------------
    There were 812 genomes (including refs) that were singleton, outlier or overlaps.
    

    Thanks for your help with this!

  6. Ben Bolduc

    Hi All,

    Thank you for reporting these issues. This issue has taken more effort to identify than I anticipated. This error seems to occur most often associated with a “random” genome being printed to stdout, followed by no genome-by-genome file. Susheel’s version indicates it’s still occurring in the most recent version (0.9.17) and has occurred since at least 0.9.13 - and I’m assuming all runs have been using the v97 prokaryotes. It’s also been mentioned that this hasn’t happened with small numbers (200), but when it gets larger (350+?) there’s an issue (or rather, lack of an output file).

    Has anyone tried to run with a lower database version (i.e. "ProkaryoticViralRefSeq94-Merged")?

    Likewise, increasing the verbosity? vcontact2 <command> -vv

    Also, for anyone with a failed run (well, not generating a genome_by_genome file), have you tried to restart the run using the intermediate files?

    vcontact2 --contigs vConTACT_contigs.csv --pcs vConTACT_pcs.csv --pc-profiles vConTACT_profiles.csv --output-dir output --db "ProkaryoticViralRefSeq97-Merged" 
    

    And has anyone tried using the vConTACT2 app on CyVerse?

    The annoying part here is that I can’t reproduce the error - but clearly it’s occurring to multiple people. I’ll need to test with a much larger dataset, outside of those that have worked successfully for me in the past (i.e. the ~15K contigs from the GOV dataset).

    If anyone who consistently has their run fails would like to share their gene-to-genome and proteins file, please send it to my bolduc.10 at osu.edu address. The data will only be used to identify the issue, and I’ll remove it once it’s solved. At this point, I’m not sure if it’s an issue stemming from genome names or some complex interaction with certain datasets' network connectivity.

    Apologies for this taking so long to resolve. There isn’t really funding for v2 (grants have end dates and they’re not too kind for infinite-length support of tools), and the recent climate has adjusted my priorities as many researchers find themselves doing computational work instead of lab work, so I haven’t had enough spare time to focus on this. Though I’ll continue to try and solve this!

  7. Susheel Bhanu Busi

    Hey @Ben Bolduc ,

    Thank you for the support despite the lack of funding. I’m sure everybody here (me included) appreciate your efforts to help our science. I tried to restart the run with the intermediate files using the following:

    vcontact2 --contigs vConTACT_contigs.csv --pcs vConTACT_pcs.csv --c1-bin /home/users/sbusi/apps/miniconda3/bin/cluster_one-1.0.jar \
        --pc-profiles vConTACT_profiles.csv --output-dir test_output --db "ProkaryoticViralRefSeq97-Merged"
    

    The output I got was the following, but still no genome_by_genome_overview.csv` file

    INFO:vcontact2.modules: Loading the clustering results
    
    
    ---------------------------Link modules and clusters----------------------------
    INFO:vcontact2.modules: 3327 contigs-modules owning association, 46543 filtered (a contig must have 50% of the PCs to own a module).
    INFO:vcontact2.modules: Linking 652 modules with 371 contigs clusters...
    INFO:vcontact2.modules: Network done 371 clusters, 652 modules and 314 edges.
    
    
    ----------------------------Exporting results files-----------------------------
    INFO:vcontact2.exports.summaries: There were 729 sequences (including references) that were singleton, outlier or overlaps.
    There were 729 genomes (including refs) that were singleton, outlier or overlaps.
    INFO:vcontact2.exports.summaries: Reading edges for 2862 contigs
    INFO:vcontact2.exports.summaries: Building PC array
    INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
    ERROR:vcontact2.exports.summaries: 'contig_11'
    

    Not sure if it has anything to do with the contig_11 error though.

    Then I tried the following with the v94 database.

    vcontact2 --contigs vConTACT_contigs.csv --pcs vConTACT_pcs.csv --c1-bin /home/users/sbusi/apps/miniconda3/bin/cluster_one-1.0.jar \
        --pc-profiles vConTACT_profiles.csv --output-dir test_output --db "ProkaryoticViralRefSeq94-Merged"
    

    and here’s the output from that:

    ---------------------------Link modules and clusters----------------------------
    INFO:vcontact2.modules: 3327 contigs-modules owning association, 46543 filtered (a contig must have 50% of the PCs to own a module).
    INFO:vcontact2.modules: Linking 652 modules with 371 contigs clusters...
    INFO:vcontact2.modules: Network done 371 clusters, 652 modules and 314 edges.
    
    
    ----------------------------Exporting results files-----------------------------
    INFO:vcontact2.exports.summaries: There were 729 sequences (including references) that were singleton, outlier or overlaps.
    There were 729 genomes (including refs) that were singleton, outlier or overlaps.
    INFO:vcontact2.exports.summaries: Reading edges for 2862 contigs
    INFO:vcontact2.exports.summaries: Building PC array
    INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
    ERROR:vcontact2.exports.summaries: 'contig_11'
    

  8. Ben Bolduc

    Thank you for all who send data. Unfortunately, still unable to reproduce the error (on Mac, Linux), so it's probably a package versioning issue. I do, however, think I've identified the block of code that is likely responsible. I am unable to finish it this week, but should have available time the week after.

  9. Ben Bolduc
    • changed status to open

    I've identified the cause and am identifying why it wasn't caught earlier in the code (I specifically have code that checks for this). The cause is due to viral genome naming when one virus' name is a "subset" of another, i.e. "phage G1" and "phage G12". So when one of these viruses is encountered, only one gets saved to the genome summary. However, since all genomes are iterated through for the final genome summary file, that virus that wasn't written gets read, but can't be found... which is why the virus genome name gets printed to screen.

    Re-opening while I squash this bug.

  10. Ben Bolduc

    I have updated vConTACT2 to 0.9.18, which includes handling of this issue. However, if anyone who has encountered this issue would like to try this new version, please do so. I’ve also tightened the dependencies, so a fresh vConTACT2 install (from bitbucket) or update should work.

  11. Susheel Bhanu Busi

    Thanks @Ben Bolduc

    @Ben Bolduc

    ! I tested it out today, and am running to the below error.

    ESC[1;42mINFOESC[1;0m:vcontact2: Saving intermediate files...
    ESC[1;42mINFOESC[1;0m:vcontact2: Read 229672 entries (dropped 2609 singletons) from /scratch/users/sbusi/cosmic_review/vibrant/VIBRANT/vcontact2_output/V6/C120/vConTACT_profiles.csv
    ESC[1;42mINFOESC[1;0m:vcontact2.contig_clusters: Exporting for ClusterONE
    ESC[1;42mINFOESC[1;0m:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClusterONE
    ESC[1;42mINFOESC[1;0m:vcontact2.contig_clusters: 372 clusters loaded (singletons and non-connected nodes are dropped).
    ESC[1;41mERRORESC[1;0m:vcontact2: Error in contig clustering
    ESC[1;41mERRORESC[1;0m:vcontact2: 'Acidianus~bottle-shaped~virus~2'
    Traceback (most recent call last):
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
        return self._engine.get_loc(key)
      File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'Acidianus~bottle-shaped~virus~2'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/bin/vcontact2", line 607, in main
        gc = vcontact2.contig_clusters.ContigCluster(pcp, output_dir, cluster_one_fp, cluster_one_args,
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/vcontact2/contig_clusters.py", line 91, in __init__
        self.clusters, self.cluster_results = self.one_cluster(os.path.join(self.folder, self.name),
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/vcontact2/contig_clusters.py", line 227, in one_cluster
        return self.load_one_clusters(fi_clusters)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/vcontact2/contig_clusters.py", line 340, in load_one_clusters
        if pd.isnull(self.contigs.loc[n, "pos_cluster"]):  # If never seen before
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexing.py", line 1418, in __getitem__
        return self._getitem_tuple(key)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexing.py", line 805, in _getitem_tuple
        return self._getitem_lowerdim(tup)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexing.py", line 929, in _getitem_lowerdim
        section = self._getitem_axis(key, axis=i)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexing.py", line 1850, in _getitem_axis
        return self._get_label(key, axis=axis)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexing.py", line 160, in _get_label
        return self.obj._xs(label, axis=axis)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/generic.py", line 3737, in xs
        loc = self.index.get_loc(key)
      File "/mnt/lscratch/users/sbusi/cosmic_review/vibrant/VIBRANT/.snakemake/conda/1cc7c9fa/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'Acidianus~bottle-shaped~virus~2'
    

    I checked the pandas version in the conda environment and have the following. It is 0.25.3

    pandas                    0.25.3           py38hb3f55d8_0    conda-forge
    

    I used the “ProkaryoticViralRefSeq97-Merged” database. Should I be using ProkaryoticViralRefSeq94-Merged instead?

    Thank you!

  12. Ben Bolduc

    Hmm. The contig clustering errors are usually due to issues with ClusterONE (in this instance, like not having it installed or java issues, etc). That said, it doesn’t appear that the archaeal viruses are in v97 (!). They’re in 94 and 201. I’ll need to re-update 97 asap.

    Please do try another database while I update v97 and let me know if that works.

  13. Susheel Bhanu Busi

    Funnily enough, I didn’t have issues with ClusterONE previously. And the test run worked fine as well. I updated it nonetheless with a clean new installation. Trying with 94 so will let you know.

  14. Susheel Bhanu Busi

    @Ben Bolduc I can confirm that with the ProkaryoticViralRefSeq94-Merged everything works as expected and I also get the genomes_by_genomes_overview.csv file.

    Thanks a lot for your help with fixing the issues.

    Sat Jul 25 00:59:15 CEST 2020
    
    ============================This is vConTACT2 0.9.18============================
    
    
    
    ----------------------------------Pre-Analysis----------------------------------
    
    
    ------------------------------Reference databases-------------------------------
    
    
    -------------------------------Protein clustering-------------------------------
    
    
    ----------------------------------Loading data----------------------------------
    
    
    --------------------------------Adding Taxonomy---------------------------------
    
    
    ------------------------Calculating Similarity Networks-------------------------
    
    
    ------------------------Contig Clustering & Affiliation-------------------------
    
    
    --------------------------------Protein modules---------------------------------
    
    
    ---------------------------Link modules and clusters----------------------------
    
    
    ----------------------------Exporting results files-----------------------------
    There were 564 genomes (including refs) that were singleton, outlier or overlaps.
    Sat Jul 25 01:34:22 CEST 2020
    

  15. 敬哲 姜

    Hi,

    I had the same problem. I’m trying to run 267,783 viral metagenome contigs through vContact2 (v 0.9.19), and couldn’t get the genome_by_genome_overview.csv. Below is the command I’ve been using:

    vcontact2 --raw-proteins oyster/ALL.contigs.cd-hit.phages_combined.simple.faa --rel-mode 'Diamond' --proteins-fp oyster/VIBRANT_genbank_table_ALL.contigs.cd-hit.tsv --db 'ProkaryoticViralRefSeq94-Merged' --pcs-mode MCL --vcs-mode ClusterONE --c1-bin /home/ubuntu/miniconda3/bin/cluster_one-1.0.jar --output-dir output-oyster -t 8
    

    The contigs ID in file ALL.contigs.cd-hit.phages_combined.simple.faa are looks like this:

    all-k141_3960179 flag=1 multi=30.9838 len=3478_1
    all-k141_3960179 flag=1 multi=30.9838 len=3478_2
    all-k141_3960179 flag=1 multi=30.9838 len=3478_3
    all-k141_3960179 flag=1 multi=30.9838 len=3478_4
    all-k141_3960179 flag=1 multi=30.9838 len=3478_5
    all-k141_3960179 flag=1 multi=30.9838 len=3478_6
    KZY2-k141_66375 flag=1 multi=14.2750 len=1170_1
    KZY2-k141_66375 flag=1 multi=14.2750 len=1170_2
    KZY2-k141_66375 flag=1 multi=14.2750 len=1170_3
    KZY2-k141_66375 flag=1 multi=14.2750 len=1170_4
    ZH1-k141_68976 flag=0 multi=4.8691 len=3854_1
    ZH1-k141_68976 flag=0 multi=4.8691 len=3854_2
    ZH1-k141_68976 flag=0 multi=4.8691 len=3854_3
    ZH1-k141_68976 flag=0 multi=4.8691 len=3854_4
    ZH1-k141_68976 flag=0 multi=4.8691 len=3854_5
    T4S1-k141_394333 flag=1 multi=4.0000 len=2367_1
    T4S1-k141_394333 flag=1 multi=4.0000 len=2367_2
    T4S1-k141_394333 flag=1 multi=4.0000 len=2367_3

    I have also tried the other format, like this:

    all-k141_3960179-flag=1-multi=30.9838-len=3478_1
    all-k141_3960179-flag=1-multi=30.9838-len=3478_2
    all-k141_3960179-flag=1-multi=30.9838-len=3478_3
    all-k141_3960179-flag=1-multi=30.9838-len=3478_4
    all-k141_3960179-flag=1-multi=30.9838-len=3478_5
    all-k141_3960179-flag=1-multi=30.9838-len=3478_6
    KZY2-k141_66375-flag=1-multi=14.2750-len=1170_1
    KZY2-k141_66375-flag=1-multi=14.2750-len=1170_2
    KZY2-k141_66375-flag=1-multi=14.2750-len=1170_3
    KZY2-k141_66375-flag=1-multi=14.2750-len=1170_4
    ZH1-k141_68976-flag=0-multi=4.8691-len=3854_1
    ZH1-k141_68976-flag=0-multi=4.8691-len=3854_2
    ZH1-k141_68976-flag=0-multi=4.8691-len=3854_3
    ZH1-k141_68976-flag=0-multi=4.8691-len=3854_4
    ZH1-k141_68976-flag=0-multi=4.8691-len=3854_5
    T4S1-k141_394333-flag=1-multi=4.0000-len=2367_1
    T4S1-k141_394333-flag=1-multi=4.0000-len=2367_2
    T4S1-k141_394333-flag=1-multi=4.0000-len=2367_3

    Finally, No matter which form, I will get the same error message, as below

    • -----------------------Contig Clustering & Affiliation-------------------------

    • -------------------------------Protein modules---------------------------------

    • --------------------------Link modules and clusters----------------------------

    • ---------------------------Exporting results files-----------------------------

    There were 517 genomes (including refs) that were singleton, outlier or overlaps.
    Traceback (most recent call last):
    File "/home/ubuntu/miniconda3/envs/vContact2/bin/vcontact2", line 757, in <module>
    main(options)
    File "/home/ubuntu/miniconda3/envs/vContact2/bin/vcontact2", line 749, in main
    profiles_fp, vc, excluded)
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/vcontact2/exports/summaries.py", line 269, in final_summary
    genome_df = summary_df.loc[summary_df['Members'].str.contains(genome, regex=False)]
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/pandas/core/indexing.py", line 1424, in getitem
    return self._getitem_axis(maybe_callable, axis=axis)
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/pandas/core/indexing.py", line 1839, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/pandas/core/indexing.py", line 1133, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/pandas/core/indexing.py", line 1092, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
    File "/home/ubuntu/miniconda3/envs/vContact2/lib/python3.7/site-packages/pandas/core/indexing.py", line 1177, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)
    KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n ...\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n dtype='float64', length=438)] are in the [index]"

    Would you please check it and see what went wrong? Thank you very much!

  16. Ben Bolduc

    Hi 敬哲 姜,

    I think the format of the gene-to-genome file might have different headers than the faa file. Could you copy-and-paste the first few lines of both files?

    For example, if your FAA file headers are like this:

    all-k141_3960179 flag=1 multi=30.9838 len=3478_1
    all-k141_3960179 flag=1 multi=30.9838 len=3478_2
    all-k141_3960179 flag=1 multi=30.9838 len=3478_3
    all-k141_3960179 flag=1 multi=30.9838 len=3478_4
    all-k141_3960179 flag=1 multi=30.9838 len=3478_5
    all-k141_3960179 flag=1 multi=30.9838 len=3478_6
    

    Then you’ll need to replace the spaces (“ “) with an underscore (“_”), to:

    all-k141_3960179_flag=1_multi=30.9838_len=3478_1
    all-k141_3960179_flag=1_multi=30.9838_len=3478_2
    all-k141_3960179_flag=1_multi=30.9838_len=3478_3
    all-k141_3960179_flag=1_multi=30.9838_len=3478_4
    all-k141_3960179_flag=1_multi=30.9838_len=3478_5
    all-k141_3960179_flag=1_multi=30.9838_len=3478_6
    

    and have the gene-to-genome file like this:

    genome_id,gene_id,keywords
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_1,none
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_2,none
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_3,none
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_4,none
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_5,none
    all-k141_3960179,all-k141_3960179_flag=1_multi=30.9838_len=3478_6,none
    

    (Note the comma (“,”) between the genome_id, gene_id and keywords)

    Also, 250K genomes is quite a lot. Usually, the error associated with too many genomes is a subprocess error, so I don’t think it’s due to that.

    Cheers,

    Ben

  17. Log in to comment