Program still running but throwing a memory Error after or during Wish run

Issue #68 closed

Haley Hallowell created an issue 2023-10-20

Hi Simon!

I am currently running a ~90MB fasta through IPHoP, and have not gotten the program to complete with this dataset, as while the program is calculating distances in Step 2 (I think), it is throwing a memory error (See below). I am running the program with 180GB of data, 2 nodes (96 cores total) and still getting this error. I’ve checked my fasta file and nothing seems weird about headers or spacing, for example:

>M16W0_k127_154605_1
ACATATGGCGACGTCATCCCGGAGAACCATGAGGGCAGCGGGATGACGTTTGACGTCGAT
GCGGAAATCTTCGCTGGCAGGACACTGGTGGTGTACGAGCGGATGTACCTCGAAAATGGC
TACGGCGCAGGAAGCATCTTGTGGCGGAGCATCAGGTCCTTCTGGACGAGGACCAGACCA

‌

This was my submission command:

sbatch --partition defq -D ./ --mem=180G --time 72:0:0 --nodes=2 --wrap 'iphop predict --fa_file vOTUs_numbered.fna --db_dir /home/hhallow1/scratch4-jsuez1/shared_databases/iphop_db/Sept_2021_pub_rw --out_dir ./iphop_number2 -t 96' -o iphop.log

‌

And here is the full log with the error:

Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselves
[1/1/Run] Running blastn against genomes...
[1/3/Run] Get relevant blast matches...
[2/1/Run] Running blastn against CRISPR...
[2/2/Run] Get relevant crispr matches...
[3/1/Run] Running (recoded)WIsH...
### Welcome to iPHoP ###
Process ForkPoolWorker-46:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 358, in get
    return _ForkingPickler.loads(res)
MemoryError
Process ForkPoolWorker-47:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 358, in get
    return _ForkingPickler.loads(res)
MemoryError
Process ForkPoolWorker-48:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 356, in get
    res = self._reader.recv_bytes()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 421, in _recv_bytes
    return self._recv(size)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 386, in _recv
    buf.write(chunk)
MemoryError

The program is still running, which i thought was strange. Any insight on what is going on? This is probably my 3rd or 4th try, continually increasing memory and cores as i go. Thanks!!

Comments (14)

Simon Roux repo owner
Yikes, this is not great, and unfortunately seems to be an issue in one of the underlying libraries so not the easiest to fix. Could you try running with “--single_thread_wish” (just add this option in “iphop predict …” ). That should bypass the “multiprocessing” thing, will be much slower but hopefully would complete ?

‌

Best,

Simon
- 2023-10-20T17:18:51+00:00
Haley Hallowell reporter
Thanks! Will try that. I also tried splitting the fasta into 2, hoping that subsetting allows pickle enough memory to create the dictionary. I will try the --single_thread_wish if this doesnt work!
- 2023-10-20T17:46:18+00:00
Haley Hallowell reporter
Do you think a fresh install would help?
- 2023-10-20T17:46:41+00:00
Simon Roux repo owner
I wonder also if running on multiple nodes is possible with this library, so if not already tried, you may want to check what happens when you only run on a single node ?
- 2023-10-20T17:48:09+00:00
Haley Hallowell reporter
I tried a single node, with the original large file, but it was stuck on calculating distances for ~3 days and ran out of the max time i can allot on the node (72hours). I’m a little surprised as i only have about ~25k sequences in the file
- 2023-10-20T17:49:51+00:00
Haley Hallowell reporter
Trying a single node with the split files might be a good option, ill submit that as well and report back!

‌
- 2023-10-20T17:50:12+00:00
Simon Roux repo owner
Right, so looking back at the memory errors you see, I’m more and more convinced they come from the job being split over multiple nodes. ~ 25k can be a bit long :-) I typically process batches of ~ 2 to 3k to make sure the job runs in a reasonable time. So I would try running smaller batches on individual nodes (without the “single_thread_wish”) and see if it fixes everything
- 2023-10-20T18:01:29+00:00
Haley Hallowell reporter
great suggestion! ill cancel the jobs and try that. Thank you for the help!
- 2023-10-20T18:02:26+00:00

Haley Hallowell reporter

Hey Simon! I had a chance to separate my fasta files into 2000 sequence chunks using this script:

from Bio import SeqIO

# Define the input .fna file and the number of sequences per split file
input_file = 'vOTUs_numbered.fna'
sequences_per_file = 2000  # You can adjust this as needed

# Initialize variables
sequence_count = 0
file_counter = 1
output_file = None #Initialize the output_file variable
with open(input_file, "r") as f:
    records = SeqIO.parse(f, "fasta")

    for record in records:
        sequence_count += 1

        if sequence_count == 1 or sequence_count > sequences_per_file:
            # Close the previous split file and open a new one
            if output_file:
                output_file.close()
            output_file = open(f'vOTUs_numbered_split_{file_counter}.fna', 'w')
            file_counter += 1
            sequence_count = 1

        # Write the current sequence to the output file
        SeqIO.write(record, output_file, "fasta")

# Close the last output file
if output_file:
    output_file.close()

And then ran a test sample to make sure things were running smoothly. The error message I received was quite different, but also a lengthy one (sorry!). I’ve attached it below. I have a feeling this is resulting from me splitting the files perhaps? Something funky with headers? I double checked the line count to make sure it was even, and head/tailed a few to make sure that they were not being cutoff in the middle of a sequence. I also double checked the wish output and there is a column titled ‘Normalized’. Any thoughts on what might be causing this? Let me know if any additional files our outputs might be helpful here.

Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselves
[1/1/Run] Running blastn against genomes...
[1/3/Run] Get relevant blast matches...
[2/1/Run] Running blastn against CRISPR...
[2/2/Run] Get relevant crispr matches...
[3/1/Run] Running (recoded)WIsH...
### Welcome to iPHoP ###
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'normalized'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3751, in _set_item_mgr
    loc = self._info_axis.get_loc(key)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'normalized'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 214, in process_batch
    rewish_results = add_pvalues(rewish_results,ref_file)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 227, in add_pvalues
    rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3602, in __setitem__
    self._set_item_frame_value(key, value)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3742, in _set_item_frame_value
    self._set_item_mgr(key, arraylike)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3754, in _set_item_mgr
    self._mgr.insert(len(self._info_axis), key, value)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1162, in insert
    block = new_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1937, in new_block
    check_ndim(values, placement, ndim)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1979, in check_ndim
    raise ValueError(
ValueError: Wrong number of items passed 3, placement implies 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/bin/iphop", line 10, in <module>
    sys.exit(cli())
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli
    args["func"](args)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 87, in main
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 44, in run_and_parse_wish
    run_rewish(args["fasta_file"],args["wishrawresult"],args["rewish_db_dir"],args["wish_negfit"],args["tmp"],threads_tmp,n_host_by_phage)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 159, in run_rewish
    async_parallel(process_batch, args_list, threads)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 251, in async_parallel
    return [r.get() for r in results]
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 251, in <listcomp>
    return [r.get() for r in results]
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: Wrong number of items passed 3, placement implies 1

‌

2023-10-23T20:33:52+00:00

Simon Roux repo owner
Hi !

Sorry, it’s a known bug we recently fixed but did not release in conda yet. It’s an easy fix though: the problem only happens when you have batches of exactly 1,000 sequences or exact multiples of 1,000 (like 2,000 :-) ) . So the fix is to use another batch size (e.g. 1,500 or 2,500), and the error should disappear.
- 2023-10-23T20:39:25+00:00
Haley Hallowell reporter
ha ha!! okay, sounds great. I will go ahead and regenerate the files and run. Thank you!
- 2023-10-23T20:41:04+00:00
Haley Hallowell reporter
Just wanted to update that this fixed the issue, thank you for all the help! Ticket can be closed
- 2023-10-24T23:41:31+00:00
Simon Roux repo owner
Great, thanks for the update !
- 2023-10-24T23:50:16+00:00
Simon Roux repo owner
- changed status to closed
Fixed
- 2023-10-24T23:50:27+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 2