cannot serialize a bytes object larger than 4 GiB

Issue #10 closed
Biao He created an issue

Hello,

I tried to cluster my viral sequences with the latest version of vConTACT2. When it came to similarity networks calculation, vcontact consumed very large memory and ended up with an OverflowError: cannot serialize a bytes object larger than 4 GiB. My dataset did contain very large sequences, almost 1 million.

Below is the detailed error.

------------------------Calculating Similarity Networks-------------------------
Traceback (most recent call last):
File "/ifs1/User/hebiao/miniconda3/bin/vcontact", line 4, in <module>
import('pkg_resources').run_script('vcontact2==0.9.13', 'vcontact')
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/pkg_resources/init.py", line 1469, in run_script
exec(script_code, namespace, namespace)
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/vcontact2-0.9.13-py3.7.egg/EGG-INFO/scripts/vcontact", line 750, in <module>
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/vcontact2-0.9.13-py3.7.egg/EGG-INFO/scripts/vcontact", line 585, in main
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/vcontact2-0.9.13-py3.7.egg/vcontact/pcprofiles.py", line 71, in init
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/vcontact2-0.9.13-py3.7.egg/vcontact/pcprofiles.py", line 150, in network
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/site-packages/vcontact2-0.9.13-py3.7.egg/vcontact/pcprofiles.py", line 150, in <listcomp>
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
put(task)
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/ifs1/User/hebiao/miniconda3/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

How to solve this problem? I would be deeply grateful for your information or help.

Kind regards,

Biao

Comments (3)

  1. Ben Bolduc

    Thank you for the bug report. Unfortunately, it looks like this is an open issue, with no good workaround. I’ve only seen 350K genomes run on vConTACT2, and have heard of larger scales run - but not 1 million though!

    I know it doesn’t fix the issue, but have you tried clustering or de-duplicating your viral genomes before running vConTACT? dRep, ClusterGenomes, dedup, or cdhit might work. 95-97% id should be sufficient, or whatever level you’re comfortable with merging genomes. That should hopefully reduce the number of sequences.

  2. Ben Bolduc

    Won't fix for current release due to lack of an "easy" workaround.

    Will work towards findings solutions that don't compromise result quality for a future release.

    Thanks again for putting this on the radar!

  3. Log in to comment