AssemblePairs hanging (most likely due to blastn?)

Issue #65 new
Julian Zhou created an issue

Is there a way to catch such behaviour? There's hardly anything worse than waking up in the morning, fully expecting a job to have finished running, only to realize that a core dump had occurred and that the process was stuck on the AP step forever. Somehow this never triggers a job failure and never gets caught by slurm (happened multiple times to me so far).

SCAN_REVERSE> True
MIN_IDENT> 0.5
EVALUE> 1e-05
MAX_HITS> 100
FILL> False
ALIGNER> blastn
NPROC> 20

PROGRESS> 04:12:26 |                    |   0% (      0) 0.0 min
PROGRESS> 04:14:05 |#                   |   5% ( 37,145) 1.7 min
PROGRESS> 04:15:44 |##                  |  10% ( 74,290) 3.3 min
PROGRESS> 04:17:24 |###                 |  15% (111,435) 5.0 min
PROGRESS> 04:19:04 |####                |  20% (148,580) 6.6 mi

Comments (11)

  1. Jason Vander Heiden

    Does it happen on not-farnam?

    You could try using usearch for the aligner instead. It isn't in the image, but you should be able to just put the binary in one of the mount point folders and run it from there.

    Not sure that'll help though, because this has a farnam weirdness smell to it...

  2. Julian Zhou reporter

    The last time this happened, I was using usearch on Farnam. This time, this was with blastn on Ruddle.

  3. Julian Zhou reporter

    Also, if I'm seeing this, chances are it probably hanged too, right? (not 100% sure coz there's no core dump file yet; but it's been like this for +1 hr)

    PROGRESS> 08:48:43 | | 0% ( 0) 0.0 min

    PROGRESS> 08:48:58 |# | 5% ( 5,602) 0.3 min

    PROGRESS> 08:49:15 |## | 10% ( 11,204) 0.5 min

    PROGRESS> 08:49:30 |### | 15% ( 16,806) 0.8 min

    PROGRESS> 08:49:45 |#### | 20% ( 22,408) 1.0 min

    PROGRESS> 08:50:01 |##### | 25% ( 28,010) 1.3 min

    PROGRESS> 08:50:16 |###### | 30% ( 33,612) 1.6 min

    PROGRESS> 08:50:31 |####### | 35% ( 39,214) 1.8 min

    PROGRESS> 08:50:46 |######## | 40% ( 44,816) 2.1 min

    PROGRESS> 08:51:03 |######### | 45% ( 50,418) 2.3 min

    PROGRESS> 08:51:19 |########## | 50% ( 56,020) 2.6 min

    PROGRESS> 08:51:34 |########### | 55% ( 61,622) 2.9 min

    PROGRESS> 08:51:50 |############ | 60% ( 67,224) 3.1 min

    PROGRESS> 08:52:05 |############# | 65% ( 72,826) 3.4 min

    PROGRESS> 08:52:21 |############## | 70% ( 78,428) 3.6 min

    PROGRESS> 08:52:37 |############### | 75% ( 84,030) 3.9 min

    PROGRESS> 08:52:53 |################ | 80% ( 89,632) 4.2 min

    PROGRESS> 08:53:08 |################# | 85% ( 95,234) 4.4 min

    PROGRESS> 08:53:24 |################## | 90% (100,836) 4.7 min

    PROGRESS> 08:53:39 |################### | 95% (106,438) 4.9 min

  4. Jason Vander Heiden

    Probably? I've never actually seen AssemblePairs hang, but that looks like something that's stuck.

    (BTW - You can use the triple-backtick code fence syntax for large blocks. You don't have to backtick every line.)

  5. Jason Vander Heiden

    If it happens on both farnam and ruddle then it's probably not an issue with those older CPUs systems we have in the kleinstein queue. The m915s, I think? Whichever systems it was that we had to compile R packages for separately.

    You could try restricting the types of nodes used? In case it is a CPU architecture compatibility issue. There's some instructions regarding that in the "Software" and "Compute Hardware" sections here: https://research.computing.yale.edu/support/hpc/clusters/farnam

  6. Julian Zhou reporter

    Yep, not on m915. Everything was run on nx360, on both Farnam and Ruddle.

    It seems like if this thing happens, you just have to keep re-running it until it works through the step.. sometimes re-running multiple times.

    Just now, a sample finished re-running the AP step, but then this happened:

    25/10/2018 10:23:26
    IDENTIFIER: 9
    DIRECTORY: /home/qz93/project/ellebedy_bulk/presto/sample_9/
    PRESTO VERSION: 0.5.10-2018.10.19
    
    START
       1: AssemblePairs sequential 10:23 10/25/18
    ERROR:
        *** Error in `/ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/bin/python': free(): invalid pointer:
     0x00002b2fb75c8120 ***
        ======= Backtrace: =========
        /lib64/libc.so.6(+0x7c619)[0x2b2f8130c619]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/numpy/core/multiarray.c
    python-35m-x86_64-linux-gnu.so(+0x7c236)[0x2b2f8a507236]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/numpy/core/multiarray.c
    python-35m-x86_64-linux-gnu.so(+0x21dfe)[0x2b2f8a4acdfe]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
    ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x75f4a)[0x2b2fb444df4a]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
    ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x7a2cd)[0x2b2fb44522cd]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyCFunction_Call+0xe9)[0x2
    b2f806cd5c9]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
    ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x218fd)[0x2b2fb43f98fd]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/python3.5/site-packages/pandas-0.17.1-py3.5-lin
    ux-x86_64.egg/pandas/lib.cpython-35m-x86_64-linux-gnu.so(+0x4b1dd)[0x2b2fb44231dd]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x8696)
    [0x2b2f807620c6]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6661)
    [0x2b2f80760091]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6661)
    [0x2b2f80760091]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0x1727a1)[0x2b2f807637a1]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyEval_EvalCodeEx+0x23)[0x
    2b2f80763893]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0xb8cb5)[0x2b2f806a9cb5]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(PyObject_Call+0x6a)[0x2b2f
    80678e8a]
        /ycga-gpfs/apps/hpc/software/Python/3.5.1-foss-2016b/lib/libpython3.5m.so.1.0(+0xa0954)[0x2b2f80691954]
        /ycga-gpfs/apps/hpc/software/P
    

    (and then it goes on and on like this for pages)

    Like o_O. But then I ran it again and it pushed through to the next step (MaskPrimers with Internal C).

  7. Julian Zhou reporter

    So not sure how much there is to do about this after all, unless you want to add a warning or something that gets issued if the step goes on for more than a certain amount of time. Otherwise, feel free to close the issue!

  8. Jason Vander Heiden

    The pipeline script should dump errors to a separate file (logs/pipeline.err).

    This looks like a compilation issue with the scientific python libraries. Try using the singularity image? Make sure to specify --cleanenv to singularity exec. And maybe --containall if you still have issues with the image accessing the host environment.

  9. Log in to comment