Fatal `EstimateError.py set` error after progress bar reads 100%

Issue #83 resolved
Julian Zhou created an issue

Error:

/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3419: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Process Process-17:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/bin/EstimateError.py", line 367, in collectEEQueue
    int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
ValueError: cannot convert float NaN to integer
ERROR> Exiting due to child process error.
NOTICE> Terminating child processes...  Done.

Console log:

Wed 28 Apr 2021 03:39:16 AM UTC
        START> EstimateError
         FILE> JOIN-uid_cluster-pass.fastq
         MODE> freq
    SET_FIELD> INDEX_UID
    MIN_COUNT> 20
MAX_DIVERSITY> None
        NPROC> 15

PROGRESS> 03:50:48 |                    |   0% (      0) 4.8 min
PROGRESS> 03:54:46 |#                   |   5% (  5,622) 8.8 min
PROGRESS> 03:56:13 |##                  |  10% ( 11,244) 10.3 min
PROGRESS> 03:57:40 |###                 |  15% ( 16,866) 11.7 min
PROGRESS> 03:58:11 |####                |  20% ( 22,488) 12.2 min
PROGRESS> 03:59:42 |#####               |  25% ( 28,110) 13.7 min
PROGRESS> 04:01:13 |######              |  30% ( 33,732) 15.3 min
PROGRESS> 04:01:55 |#######             |  35% ( 39,354) 16.0 min
PROGRESS> 04:02:23 |########            |  40% ( 44,976) 16.4 min
PROGRESS> 04:04:47 |#########           |  45% ( 50,598) 18.8 min
PROGRESS> 04:05:32 |##########          |  50% ( 56,220) 19.6 min
PROGRESS> 04:06:14 |###########         |  55% ( 61,842) 20.3 min
PROGRESS> 04:06:47 |############        |  60% ( 67,464) 20.8 min
PROGRESS> 04:07:26 |#############       |  65% ( 73,086) 21.5 min
PROGRESS> 04:08:11 |##############      |  70% ( 78,708) 22.2 min
PROGRESS> 04:08:41 |###############     |  75% ( 84,330) 22.7 min
PROGRESS> 04:08:50 |################    |  80% ( 89,952) 22.9 min
PROGRESS> 04:09:00 |#################   |  85% ( 95,574) 23.1 min
PROGRESS> 04:09:59 |##################  |  90% (101,196) 24.0 min
PROGRESS> 04:11:08 |################### |  95% (106,818) 25.2 min
PROGRESS> 04:11:10 |####################| 100% (112,428) 25.2 min

Comments (13)

  1. Julian Zhou reporter

    I attached a MRE.

    I hacked presto and asked collectEEQueue in EstimateError.py to output dist and dist_df before line 367, the offending error, during an actual run that was experiencing the error.

    The error can then be reproduced locally as well:

    import pickle
    import pandas as pd
    import numpy as np
    
    with open('debug_EE_dist.pkl', 'rb') as input:
         dist = pickle.load(input)
    
    with open('debug_EE_dist_df.pkl', 'rb') as input:
         dist_df = pickle.load(input)
    
    # line 367
    # error happens
    thresh_df = pd.DataFrame.from_dict({'thresh': {'ALL': dist_df.index[np.argmax(dist) + \
                                               int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
                                                            if dist[index] == np.min(dist)]))]}
                                                })
    

    😕

  2. Julian Zhou reporter

    Man, that thresh_df line is insane!

    Traced the error down to:

    int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])
    if dist[index] == np.min(dist)]))

    Apparently,

    `[index for index in np.argsort(dist[:int(len(dist)*0.75)])
    if dist[index] == np.min(dist)]` evaluates to [].

    As a result, int(np.mean([])) complains about ValueError: cannot convert float NaN to integer.

    Looking at the list comprehension,

    np.min(dist) evaluates to 0.0

    The list comprehension evaluates to an empty list because

    for index in np.argsort(dist[:int(len(dist)*0.75)]):
        print(dist[index])
    2204.0
    2844.0
    3120.0
    3887.0
    4270.0
    4771.0
    5883.0
    9013.0
    9371.0
    9694.0
    15772.0
    16762.0
    18100.0
    20350.0
    25545.0
    25753.0
    40180.0
    41889.0
    45282.0
    45910.0
    55368.0
    56283.0
    58865.0
    63356.0
    64246.0
    64574.0
    68575.0
    69425.0
    104577.0
    134327.0
    267810.0
    378157.0
    494295.0
    893743.0
    1436839.0
    2773445.0
    14354059.0
    

    None of which equates np.min(dist)

  3. Julian Zhou reporter

    Not sure if it matters:

    I got this error when

    • dumping the entire 12G of fastq (9+M sequences) to EE with -n 20
    • subsampling to 100000 sequences with -n 5

    I did not get an error when

    • subsampling to 10000 sequences with -n 20 (but with my data that seems useless an estimation because there was only 1 set with >20 sequences that “passed“)

  4. Jason Vander Heiden

    Potentially relevant pandas change in v0.24 that added support for NaN in integer vectors.

    https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

    (And, yeah, that thresh_df line has some readability issues…)

    Also, potentially relevant as a stop-gap solution, you can specific -f to SplitSeq-sample to randomly sample evenly from sequences with the same value in the specified field. Eg, -f UMI -n 100 will sample up to 100 sequences from each unique UMI. You will change the read distribution, but it’s a good way to compress huge UMI groups that dominate the read counts.

  5. Julian Zhou reporter

    Thanks. I tried the stop-gap, but the same error occurred. The calculation itself (-f INDEX_UID -n 100 ==> total 3444378 sets) was only under 10 min (according to the progress bar). Still, cool trick about SplitSeq sample that I didn’t know about!

    So is int(np.mean([])) supposed to be 0 in such a case? If so, np.argmax(dist) evaluates to 0 too here; and so dist_df.index[np.argmax(dist)+0] evaluates to 0.0, which becomes the value of THRESH: ALL. Does that mean thresh is supposed to be 1-0.0 = 1 here? Which basically means there’s no point to cluster by sequences?

    I’d debug more but I don’t know what exactly thresh_df is trying to do here, and the rest of EE.py is quite complex too..

  6. Jason Vander Heiden

    I’m not sure. I’ll have to actually look at it. EE is a mix of very old code by me and some stuff Roy added later for the UMI collision methods. I’ll take a look a bit later… (day job keeps me pretty busy these days).

    I seems feasible that the mean distance is indeed zero and it’s just not handling it properly. My gut would be some sort of numpy/pandas version issue if that’s case, because that should’ve popped up before.

  7. Julian Zhou reporter

    Phew. I spent some time investigating this as I (thought I) needed this to work for a current project. I don’t think it’s a numpy/pandas version issue. The core problem, if I’m not wrong, lies in the logic written in that thresh_df line. In addition to that, I also identified multiple other errors in the EE-related code. It’s a bit too complicated to explain them here, so I put them in this shared Benchling page.

    Also tagging @ruoyijiangyale since I think this is essentially his code?

  8. Jason Vander Heiden

    Very nice. Yeah, @Roy Jiang , is probably the one to fix this issue. IIRC, there’s an assumption of a binomial distribution for pairwise distances (based on what you would get with random 4 characters strings) and those threshold windows are derived from that.

  9. Roy Jiang

    Hey Julian,
    Welcome to this corner of presto. This looks like a bug in the code that you caught.

    #This...
    int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist)]))
    
    #Should be...
    int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist[:int(len(dist)*0.75)])]))
    
    #I'm guessing the cases where this was tested on
    np.min(dist) == np.min(dist[:int(len(dist)*0.75)])])
    #so it was missed...
    
    #Probably should've written it...
    #consider the dist up to 0.75 of the dist profile and find index of the min value
    dist_window = dist[:int(len(dist)*0.75)]
    np.argmin(dist_window)
    

    Regarding bigger issues with EstimateError.py set...
    The goal is to solve a 1D minimum threshold identification problem with the lowest tech method possible.

    EstimateError.py set is challenging for the reasons you describe; because sequencing protocols have different error profiles and this was optimized using our protocols based on a few 2d histogram calculations from a long time ago.

    The reason we didn't try so hard is because 0.8 is the widest cluster that CD-HIT-EST can cluster. So there was no point coming up with a sophisticated way to find the threshold when the open source (no vsearch/usearch!) high speed clustering technique had limitations anyways. I know from simulations, this method works though.

    EstimateError.py the original was written before my time for calculating Illumina sequencing error profiles. Then we adapted it for this specific problem.

  10. Log in to comment