Protein clustering : OverflowError

Issue #80 new
Former user created an issue

Hi all, First of all, thanks for the great tool.

I am trying to run the program on a set of 60k prophages but I get an error message at the "Protein clustering" step :

============================This is vConTACT2 0.9.19============================



----------------------------------Pre-Analysis----------------------------------


------------------------------Reference databases-------------------------------


-------------------------------Protein clustering-------------------------------
Traceback (most recent call last):
  File "/home/conchae/.conda/envs/vContact2/bin/vcontact2", line 757, in <module>
    main(options)
  File "/home/conchae/.conda/envs/vContact2/bin/vcontact2", line 470, in main
    pcs_fp, gene2genome_df, pcs_mode)
  File "/home/conchae/.conda/envs/vContact2/lib/python3.7/site-packages/vcontact2/protein_clusters.py", line 187, in build_clusters
    clusters_df, name, c = load_mcl_clusters(fp)
  File "/home/conchae/.conda/envs/vContact2/lib/python3.7/site-packages/vcontact2/protein_clusters.py", line 249, in load_mcl_clusters
    formatter = "PC_{{:>0{}}}".format(int(round(np.log10(nb_clusters))+1))
OverflowError: cannot convert float infinity to integer

I know that packages version can sometimes be an issue, but I can't spot anything wrong with the current versions :

conda list
# packages in environment at /home/user/.conda/envs/vContact2:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
biopython                 1.78             py37h5e8e339_2    conda-forge
blas                      1.1                    openblas    conda-forge
blast                     2.5.0                hc0b0e79_3    bioconda
blosc                     1.21.0               h9c3ff4c_0    conda-forge
boost                     1.70.0           py37h9de70de_1    conda-forge
boost-cpp                 1.70.0               h7b93d67_3    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h7f98852_1    conda-forge
ca-certificates           2022.5.18.1          ha878542_0    conda-forge
certifi                   2022.5.18.1      py37h89c1867_0    conda-forge
decorator                 4.4.2                      py_0    conda-forge
diamond                   2.0.8                h56fc30b_0    bioconda
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
icu                       67.1                 he1b5a44_0    conda-forge
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libcurl                   7.75.0               hc4aaa36_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
mcl                       14.137          pl526h516909a_5    bioconda
mock                      4.0.3            py37h89c1867_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
networkx                  2.5                        py_0    conda-forge
numexpr                   2.7.1            py37h0da4684_1    conda-forge
numpy                     1.19.0                   pypi_0    pypi
openblas                  0.3.3                ha44fe06_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
pandas                    0.25.0           py37hb3f55d8_0    conda-forge
perl                      5.26.2            h36c2ea0_1008    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
psutil                    5.8.0            py37h5e8e339_1    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pytables                  3.6.1            py37h56451d4_2    conda-forge
python                    3.7.9                h7579374_0  
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
scikit-learn              0.20.4          py37_blas_openblashebff5e3_0  [blas_openblas]  conda-forge
scipy                     1.2.0           py37_blas_openblashb06ca3d_200  [blas_openblas]  conda-forge
setuptools                49.6.0           py37h89c1867_3    conda-forge
singularity               2.4.2                         0    bioconda
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.35.2               h74cdb3f_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
vcontact2                 0.9.19                     py_0    bioconda
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

Thanks in advance, Best, Robby

Comments (5)

  1. Robby Concha-Eloko

    So this change in the file protein_clusters.py", line 249 did the job :

    # Original function :
    def load_mcl_clusters(fi):
    
        """
        Load given clusters file
    
        Args:
            fi (str): path to clusters file
            proteins_df (dataframe): A dataframe giving the protein and its contig.
        Returns: 
            tuple: dataframe proteins and dataframe clusters
        """
    
        # Read MCL
        with open(fi) as f:
            c = [line.rstrip("\n").split("\t") for line in f]
        c = [x for x in c if len(c) > 1]
        nb_clusters = len(c)
        formatter = "PC_{{:>0{}}}".format(int(round(np.log10(nb_clusters))+1))
        name = [formatter.format(str(i)) for i in range(nb_clusters)]
        size = [len(i) for i in c]
        clusters_df = pd.DataFrame({"size": size, "pc_id": name}).set_index("pc_id")
    
        return clusters_df, name, c
    
    
    # Modified :
    def round_int(x):
        if x in [float("-inf"),float("inf")]: return float("nan")
        return int(round(x))
    
    
    def load_mcl_clusters(fi):
    
        """
        Load given clusters file
    
        Args:
            fi (str): path to clusters file
            proteins_df (dataframe): A dataframe giving the protein and its contig.
        Returns: 
            tuple: dataframe proteins and dataframe clusters
        """
    
        # Read MCL
        with open(fi) as f:
            c = [line.rstrip("\n").split("\t") for line in f]
        c = [x for x in c if len(c) > 1]
        nb_clusters = len(c)
        formatter = "PC_{{:>0{}}}".format(round_int(np.log10(nb_clusters))+1)
        name = [formatter.format(str(i)) for i in range(nb_clusters)]
        size = [len(i) for i in c]
        clusters_df = pd.DataFrame({"size": size, "pc_id": name}).set_index("pc_id")
    
        return clusters_df, name, c
    

    However, I landed on this error :

    ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'

    Changing the pandas version to 0.25.3 led me to another error. I’ll try Jeffrey’s solution.

    Best

    Robby

  2. jiaojiao guan

    hi, there,I have the same error like you

    OverflowError: cannot convert float infinity to integer

    so, I change the file protein_clusters.py", line 249 as same as you. However I got the othe issues: KeyError: 'cluster'

    ?[1;42mINFO?[1;0m:vcontact2: Found Diamond: /home/xubotang2/miniconda3/envs/vContact2/bin/diamond
    ?[1;42mINFO?[1;0m:vcontact2: Found MCL: /home/xubotang2/miniconda3/envs/vContact2/bin/mcxload
    ?[1;42mINFO?[1;0m:vcontact2: Identified 4 CPUs
    ?[1;42mINFO?[1;0m:vcontact2: Using reference database: ProkaryoticViralRefSeq94-Merged
    ?[1;42mINFO?[1;0m:vcontact2: Using existing directory ./output.
    ?[1;42mINFO?[1;0m:vcontact2: Identified existing 'merged.faa' in output path: re-using...
    ?[1;42mINFO?[1;0m:vcontact2: Re-using existing Diamond file...
    ?[1;42mINFO?[1;0m:vcontact2: Loading proteins...
    ?[1;42mINFO?[1;0m:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
    ?[1;43mDEBUG?[1;0m:vcontact2: Read 333767 proteins from out_map.csv.
    ?[1;43mDEBUG?[1;0m:vcontact2: File merged.self-diamond.tab_mcl20.clusters exists and will be used. Use -f to overwrite.
    ?[1;42mINFO?[1;0m:vcontact2: Building the cluster and profiles (this may take some time...)
    If it fails, try re-running using --blast-fp flag and specifying merged.self-diamond.tab (or merged.self-blastp.tab)
    Traceback (most recent call last):
    File "/home/xubotang2/miniconda3/envs/vContact2/bin/vcontact2", line 834, in <module>
    main(options)
    File "/home/xubotang2/miniconda3/envs/vContact2/bin/vcontact2", line 526, in main
    protein_df, clusters_df, profiles_df, contigs_df = vcontact2.protein_clusters.build_clusters(
    File "/home/xubotang2/miniconda3/envs/vContact2/lib/python3.8/site-packages/vcontact2/protein_clusters.py", line 209, in build_clusters
    for clust, prots in gene2genome.groupby("cluster"):
    File "/home/xubotang2/miniconda3/envs/vContact2/lib/python3.8/site-packages/pandas/core/frame.py", line 6717, in groupby
    return DataFrameGroupBy(
    File "/home/xubotang2/miniconda3/envs/vContact2/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 560, in init
    grouper, exclusions, obj = get_grouper(
    File "/home/xubotang2/miniconda3/envs/vContact2/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 811, in get_grouper
    raise KeyError(gpr)
    KeyError: 'cluster'

    Did you have the same issues?

    Thank you !

  3. Wilson Elias Castillo

    Hi, I have the same issue as Jiaojiao, did someone can solve this issue?

    Thank you!

    Wilson

  4. Robby Concha-Eloko

    Hi all,

    This is a recipe that worked for me at the end :

    conda create -n vcontact2 python=3.7
    conda activate vcontact2
    
    conda install -c anaconda networkx=2.2
    conda install -c anaconda numpy=1.15.4
    conda install -c anaconda scipy=1.2.0
    conda install -c anaconda pandas=1.0.5
    conda install -c anaconda scikit-learn=0.20.2
    conda install -c anaconda biopython=1.73
    conda install -c anaconda hdf5=1.10.4
    conda install -c anaconda pytables=3.6.1
    conda install -c anaconda pyparsing=2.4.6
    
    conda install -c bioconda diamond=0.9.14
    conda install -c bioconda mcl=14.137
    conda install -c bioconda blast=2.7
    conda install -c bioconda clusterone
    
    conda install -y -c bioconda vcontact2
    

    Then execute the following changes in the respective .py files :

    AttributeError: 'DataFrame' object has no attribute 'ix' 
    
    ==> Change ".ix" to ".loc" and it should work correctly.
    /home/user/.conda/envs/vcontact2/lib/python3.7/site-packages/vcontact/matrices.py line 70
    /home/user/.conda/envs/vcontact2/lib/python3.7/site-packages/vcontact/modules.py line 252
    

    Best

  5. PP Q

    To whom it may concern,

    I met same problem, while when I changed two columns to three of file “gene_to_genome.csv“, it was fixed.

    like: Form “F1608-028contig-4748.fna_1,F1608-028contig-4748.fna“ to “F1608-028contig-4748.fna_1,F1608-028contig-4748.fna,None“

  6. Log in to comment