Program still running but throwing a memory Error after or during Wish run

Issue #68 closed
Haley Hallowell created an issue

Hi Simon!

I am currently running a ~90MB fasta through IPHoP, and have not gotten the program to complete with this dataset, as while the program is calculating distances in Step 2 (I think), it is throwing a memory error (See below). I am running the program with 180GB of data, 2 nodes (96 cores total) and still getting this error. I’ve checked my fasta file and nothing seems weird about headers or spacing, for example:

>M16W0_k127_154605_1
ACATATGGCGACGTCATCCCGGAGAACCATGAGGGCAGCGGGATGACGTTTGACGTCGAT
GCGGAAATCTTCGCTGGCAGGACACTGGTGGTGTACGAGCGGATGTACCTCGAAAATGGC
TACGGCGCAGGAAGCATCTTGTGGCGGAGCATCAGGTCCTTCTGGACGAGGACCAGACCA 

This was my submission command:

sbatch --partition defq -D ./ --mem=180G --time 72:0:0 --nodes=2 --wrap 'iphop predict --fa_file vOTUs_numbered.fna --db_dir /home/hhallow1/scratch4-jsuez1/shared_databases/iphop_db/Sept_2021_pub_rw --out_dir ./iphop_number2 -t 96' -o iphop.log

And here is the full log with the error:

Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselves
[1/1/Run] Running blastn against genomes...
[1/3/Run] Get relevant blast matches...
[2/1/Run] Running blastn against CRISPR...
[2/2/Run] Get relevant crispr matches...
[3/1/Run] Running (recoded)WIsH...
### Welcome to iPHoP ###
Process ForkPoolWorker-46:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 358, in get
    return _ForkingPickler.loads(res)
MemoryError
Process ForkPoolWorker-47:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 358, in get
    return _ForkingPickler.loads(res)
MemoryError
Process ForkPoolWorker-48:
Traceback (most recent call last):
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/queues.py", line 356, in get
    res = self._reader.recv_bytes()
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 421, in _recv_bytes
    return self._recv(size)
  File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/connection.py", line 386, in _recv
    buf.write(chunk)
MemoryError

The program is still running, which i thought was strange. Any insight on what is going on? This is probably my 3rd or 4th try, continually increasing memory and cores as i go. Thanks!!

Comments (14)

  1. Simon Roux repo owner

    Yikes, this is not great, and unfortunately seems to be an issue in one of the underlying libraries so not the easiest to fix. Could you try running with “--single_thread_wish” (just add this option in “iphop predict …” ). That should bypass the “multiprocessing” thing, will be much slower but hopefully would complete ?

    Best,

    Simon

  2. Haley Hallowell reporter

    Thanks! Will try that. I also tried splitting the fasta into 2, hoping that subsetting allows pickle enough memory to create the dictionary. I will try the --single_thread_wish if this doesnt work!

  3. Simon Roux repo owner

    I wonder also if running on multiple nodes is possible with this library, so if not already tried, you may want to check what happens when you only run on a single node ?

  4. Haley Hallowell reporter

    I tried a single node, with the original large file, but it was stuck on calculating distances for ~3 days and ran out of the max time i can allot on the node (72hours). I’m a little surprised as i only have about ~25k sequences in the file

  5. Haley Hallowell reporter

    Trying a single node with the split files might be a good option, ill submit that as well and report back!

  6. Simon Roux repo owner

    Right, so looking back at the memory errors you see, I’m more and more convinced they come from the job being split over multiple nodes. ~ 25k can be a bit long :-) I typically process batches of ~ 2 to 3k to make sure the job runs in a reasonable time. So I would try running smaller batches on individual nodes (without the “single_thread_wish”) and see if it fixes everything

  7. Haley Hallowell reporter

    Hey Simon! I had a chance to separate my fasta files into 2000 sequence chunks using this script:

    from Bio import SeqIO
    
    # Define the input .fna file and the number of sequences per split file
    input_file = 'vOTUs_numbered.fna'
    sequences_per_file = 2000  # You can adjust this as needed
    
    # Initialize variables
    sequence_count = 0
    file_counter = 1
    output_file = None #Initialize the output_file variable
    with open(input_file, "r") as f:
        records = SeqIO.parse(f, "fasta")
    
        for record in records:
            sequence_count += 1
    
            if sequence_count == 1 or sequence_count > sequences_per_file:
                # Close the previous split file and open a new one
                if output_file:
                    output_file.close()
                output_file = open(f'vOTUs_numbered_split_{file_counter}.fna', 'w')
                file_counter += 1
                sequence_count = 1
    
            # Write the current sequence to the output file
            SeqIO.write(record, output_file, "fasta")
    
    # Close the last output file
    if output_file:
        output_file.close()
    

    And then ran a test sample to make sure things were running smoothly. The error message I received was quite different, but also a lengthy one (sorry!). I’ve attached it below. I have a feeling this is resulting from me splitting the files perhaps? Something funky with headers? I double checked the line count to make sure it was even, and head/tailed a few to make sure that they were not being cutoff in the middle of a sequence. I also double checked the wish output and there is a column titled ‘Normalized’. Any thoughts on what might be causing this? Let me know if any additional files our outputs might be helpful here.


    Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselves
    [1/1/Run] Running blastn against genomes...
    [1/3/Run] Get relevant blast matches...
    [2/1/Run] Running blastn against CRISPR...
    [2/2/Run] Get relevant crispr matches...
    [3/1/Run] Running (recoded)WIsH...
    ### Welcome to iPHoP ###
    multiprocessing.pool.RemoteTraceback: 
    """
    Traceback (most recent call last):
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
        return self._engine.get_loc(casted_key)
      File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'normalized'
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3751, in _set_item_mgr
        loc = self._info_axis.get_loc(key)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
        raise KeyError(key) from err
    KeyError: 'normalized'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
        result = (True, func(*args, **kwds))
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 214, in process_batch
        rewish_results = add_pvalues(rewish_results,ref_file)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 227, in add_pvalues
        rewish_results["normalized"] = rewish_results.apply(lambda x: transform(x['LL'],x['Host'],ref_mat), axis=1)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3602, in __setitem__
        self._set_item_frame_value(key, value)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3742, in _set_item_frame_value
        self._set_item_mgr(key, arraylike)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/frame.py", line 3754, in _set_item_mgr
        self._mgr.insert(len(self._info_axis), key, value)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1162, in insert
        block = new_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1937, in new_block
        check_ndim(values, placement, ndim)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1979, in check_ndim
        raise ValueError(
    ValueError: Wrong number of items passed 3, placement implies 1
    """
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/hhallow1/.conda/envs/iphop_env/bin/iphop", line 10, in <module>
        sys.exit(cli())
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli
        args["func"](args)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 87, in main
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 44, in run_and_parse_wish
        run_rewish(args["fasta_file"],args["wishrawresult"],args["rewish_db_dir"],args["wish_negfit"],args["tmp"],threads_tmp,n_host_by_phage)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 159, in run_rewish
        async_parallel(process_batch, args_list, threads)
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 251, in async_parallel
        return [r.get() for r in results]
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/wish.py", line 251, in <listcomp>
        return [r.get() for r in results]
      File "/home/hhallow1/.conda/envs/iphop_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
        raise self._value
    ValueError: Wrong number of items passed 3, placement implies 1
    

  8. Simon Roux repo owner

    Hi !

    Sorry, it’s a known bug we recently fixed but did not release in conda yet. It’s an easy fix though: the problem only happens when you have batches of exactly 1,000 sequences or exact multiples of 1,000 (like 2,000 :-) ) . So the fix is to use another batch size (e.g. 1,500 or 2,500), and the error should disappear.

  9. Haley Hallowell reporter

    ha ha!! okay, sounds great. I will go ahead and regenerate the files and run. Thank you!

  10. Haley Hallowell reporter

    Just wanted to update that this fixed the issue, thank you for all the help! Ticket can be closed

  11. Log in to comment