Fatal `EstimateError.py set` error after progress bar reads 100%

Issue #83 resolved

Julian Zhou created an issue 2021-04-28

Error:

/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3419: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Process Process-17:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/bin/EstimateError.py", line 367, in collectEEQueue
    int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
ValueError: cannot convert float NaN to integer
ERROR> Exiting due to child process error.
NOTICE> Terminating child processes...  Done.

Console log:

Wed 28 Apr 2021 03:39:16 AM UTC
        START> EstimateError
         FILE> JOIN-uid_cluster-pass.fastq
         MODE> freq
    SET_FIELD> INDEX_UID
    MIN_COUNT> 20
MAX_DIVERSITY> None
        NPROC> 15

PROGRESS> 03:50:48 |                    |   0% (      0) 4.8 min
PROGRESS> 03:54:46 |#                   |   5% (  5,622) 8.8 min
PROGRESS> 03:56:13 |##                  |  10% ( 11,244) 10.3 min
PROGRESS> 03:57:40 |###                 |  15% ( 16,866) 11.7 min
PROGRESS> 03:58:11 |####                |  20% ( 22,488) 12.2 min
PROGRESS> 03:59:42 |#####               |  25% ( 28,110) 13.7 min
PROGRESS> 04:01:13 |######              |  30% ( 33,732) 15.3 min
PROGRESS> 04:01:55 |#######             |  35% ( 39,354) 16.0 min
PROGRESS> 04:02:23 |########            |  40% ( 44,976) 16.4 min
PROGRESS> 04:04:47 |#########           |  45% ( 50,598) 18.8 min
PROGRESS> 04:05:32 |##########          |  50% ( 56,220) 19.6 min
PROGRESS> 04:06:14 |###########         |  55% ( 61,842) 20.3 min
PROGRESS> 04:06:47 |############        |  60% ( 67,464) 20.8 min
PROGRESS> 04:07:26 |#############       |  65% ( 73,086) 21.5 min
PROGRESS> 04:08:11 |##############      |  70% ( 78,708) 22.2 min
PROGRESS> 04:08:41 |###############     |  75% ( 84,330) 22.7 min
PROGRESS> 04:08:50 |################    |  80% ( 89,952) 22.9 min
PROGRESS> 04:09:00 |#################   |  85% ( 95,574) 23.1 min
PROGRESS> 04:09:59 |##################  |  90% (101,196) 24.0 min
PROGRESS> 04:11:08 |################### |  95% (106,818) 25.2 min
PROGRESS> 04:11:10 |####################| 100% (112,428) 25.2 min

‌

Comments (13)

Julian Zhou reporter

attached debug_EE_dist.pkl
attached debug_EE_dist_df.pkl

  <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
  <div class="mask"></div>
</div>

</div> </form>

2021-04-28T08:52:16+00:00

Julian Zhou reporter

I attached a MRE.

I hacked presto and asked collectEEQueue in EstimateError.py to output dist and dist_df before line 367, the offending error, during an actual run that was experiencing the error.

The error can then be reproduced locally as well:

import pickle
import pandas as pd
import numpy as np

with open('debug_EE_dist.pkl', 'rb') as input:
     dist = pickle.load(input)

with open('debug_EE_dist_df.pkl', 'rb') as input:
     dist_df = pickle.load(input)

# line 367
# error happens
thresh_df = pd.DataFrame.from_dict({'thresh': {'ALL': dist_df.index[np.argmax(dist) + \
                                           int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
                                                        if dist[index] == np.min(dist)]))]}
                                            })

2021-04-28T08:55:39+00:00

Julian Zhou reporter
- changed title to Fatal `EstimateError.py set` error after progress bar reads 100%
- 2021-04-28T08:57:06+00:00
Julian Zhou reporter
Man, that thresh_df line is insane!

Traced the error down to:

int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]))

Apparently,

`[index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]` evaluates to [].

As a result, int(np.mean([])) complains about ValueError: cannot convert float NaN to integer.

Looking at the list comprehension,

np.min(dist) evaluates to 0.0

The list comprehension evaluates to an empty list because
```
for index in np.argsort(dist[:int(len(dist)*0.75)]):
    print(dist[index])
2204.0
2844.0
3120.0
3887.0
4270.0
4771.0
5883.0
9013.0
9371.0
9694.0
15772.0
16762.0
18100.0
20350.0
25545.0
25753.0
40180.0
41889.0
45282.0
45910.0
55368.0
56283.0
58865.0
63356.0
64246.0
64574.0
68575.0
69425.0
104577.0
134327.0
267810.0
378157.0
494295.0
893743.0
1436839.0
2773445.0
14354059.0
```
None of which equates np.min(dist)
- 2021-04-28T09:07:00+00:00
Julian Zhou reporter
Not sure if it matters:

I got this error when
- dumping the entire 12G of fastq (9+M sequences) to EE with -n 20
- subsampling to 100000 sequences with -n 5
I did not get an error when
- subsampling to 10000 sequences with -n 20 (but with my data that seems useless an estimation because there was only 1 set with >20 sequences that “passed“)
‌
- 2021-04-28T15:54:51+00:00
Jason Vander Heiden
Potentially relevant pandas change in v0.24 that added support for NaN in integer vectors.

https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

(And, yeah, that thresh_df line has some readability issues…)

Also, potentially relevant as a stop-gap solution, you can specific -f to SplitSeq-sample to randomly sample evenly from sequences with the same value in the specified field. Eg, -f UMI -n 100 will sample up to 100 sequences from each unique UMI. You will change the read distribution, but it’s a good way to compress huge UMI groups that dominate the read counts.
- 2021-04-28T16:30:11+00:00
Julian Zhou reporter
Thanks. I tried the stop-gap, but the same error occurred. The calculation itself (-f INDEX_UID -n 100 ==> total 3444378 sets) was only under 10 min (according to the progress bar). Still, cool trick about SplitSeq sample that I didn’t know about!

So is int(np.mean([])) supposed to be 0 in such a case? If so, np.argmax(dist) evaluates to 0 too here; and so dist_df.index[np.argmax(dist)+0] evaluates to 0.0, which becomes the value of THRESH: ALL. Does that mean thresh is supposed to be 1-0.0 = 1 here? Which basically means there’s no point to cluster by sequences?

I’d debug more but I don’t know what exactly thresh_df is trying to do here, and the rest of EE.py is quite complex too..
- 2021-04-28T18:52:33+00:00
Jason Vander Heiden
I’m not sure. I’ll have to actually look at it. EE is a mix of very old code by me and some stuff Roy added later for the UMI collision methods. I’ll take a look a bit later… (day job keeps me pretty busy these days).

I seems feasible that the mean distance is indeed zero and it’s just not handling it properly. My gut would be some sort of numpy/pandas version issue if that’s case, because that should’ve popped up before.
- 2021-04-28T20:03:58+00:00
Julian Zhou reporter
Phew. I spent some time investigating this as I (thought I) needed this to work for a current project. I don’t think it’s a numpy/pandas version issue. The core problem, if I’m not wrong, lies in the logic written in that thresh_df line. In addition to that, I also identified multiple other errors in the EE-related code. It’s a bit too complicated to explain them here, so I put them in this shared Benchling page.

Also tagging @ruoyijiangyale since I think this is essentially his code?
- 2021-04-29T04:21:37+00:00
Jason Vander Heiden
Very nice. Yeah, @Roy Jiang , is probably the one to fix this issue. IIRC, there’s an assumption of a binomial distribution for pairwise distances (based on what you would get with random 4 characters strings) and those threshold windows are derived from that.

‌
- 2021-04-29T16:27:26+00:00
Roy Jiang
Hey Julian,
Welcome to this corner of presto. This looks like a bug in the code that you caught.
```
#This...
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist)]))

#Should be...
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist[:int(len(dist)*0.75)])]))

#I'm guessing the cases where this was tested on
np.min(dist) == np.min(dist[:int(len(dist)*0.75)])])
#so it was missed...

#Probably should've written it...
#consider the dist up to 0.75 of the dist profile and find index of the min value
dist_window = dist[:int(len(dist)*0.75)]
np.argmin(dist_window)
```
Regarding bigger issues with EstimateError.py set...
The goal is to solve a 1D minimum threshold identification problem with the lowest tech method possible.

EstimateError.py set is challenging for the reasons you describe; because sequencing protocols have different error profiles and this was optimized using our protocols based on a few 2d histogram calculations from a long time ago.

The reason we didn't try so hard is because 0.8 is the widest cluster that CD-HIT-EST can cluster. So there was no point coming up with a sophisticated way to find the threshold when the open source (no vsearch/usearch!) high speed clustering technique had limitations anyways. I know from simulations, this method works though.

EstimateError.py the original was written before my time for calculating Illumina sequencing error profiles. Then we adapted it for this specific problem.
- 2021-04-29T19:16:37+00:00
Jason Vander Heiden
I added the code changes from @ruoyijiangyale in 5c2e000.
- 2021-08-09T00:31:48+00:00
Jason Vander Heiden
- changed status to resolved
Assuming we're good after 5c2e000.
- 2021-10-26T18:28:41+00:00
Log in to comment

Assignee: –

Type: bug

Priority: critical

Status: resolved

Votes: 0

Watchers: 3