- attached debug_EE_dist.pkl
- attached debug_EE_dist_df.pkl
Fatal `EstimateError.py set` error after progress bar reads 100%
Error:
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3419: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Process Process-17:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/bin/EstimateError.py", line 367, in collectEEQueue
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
ValueError: cannot convert float NaN to integer
ERROR> Exiting due to child process error.
NOTICE> Terminating child processes... Done.
Console log:
Wed 28 Apr 2021 03:39:16 AM UTC
START> EstimateError
FILE> JOIN-uid_cluster-pass.fastq
MODE> freq
SET_FIELD> INDEX_UID
MIN_COUNT> 20
MAX_DIVERSITY> None
NPROC> 15
PROGRESS> 03:50:48 | | 0% ( 0) 4.8 min
PROGRESS> 03:54:46 |# | 5% ( 5,622) 8.8 min
PROGRESS> 03:56:13 |## | 10% ( 11,244) 10.3 min
PROGRESS> 03:57:40 |### | 15% ( 16,866) 11.7 min
PROGRESS> 03:58:11 |#### | 20% ( 22,488) 12.2 min
PROGRESS> 03:59:42 |##### | 25% ( 28,110) 13.7 min
PROGRESS> 04:01:13 |###### | 30% ( 33,732) 15.3 min
PROGRESS> 04:01:55 |####### | 35% ( 39,354) 16.0 min
PROGRESS> 04:02:23 |######## | 40% ( 44,976) 16.4 min
PROGRESS> 04:04:47 |######### | 45% ( 50,598) 18.8 min
PROGRESS> 04:05:32 |########## | 50% ( 56,220) 19.6 min
PROGRESS> 04:06:14 |########### | 55% ( 61,842) 20.3 min
PROGRESS> 04:06:47 |############ | 60% ( 67,464) 20.8 min
PROGRESS> 04:07:26 |############# | 65% ( 73,086) 21.5 min
PROGRESS> 04:08:11 |############## | 70% ( 78,708) 22.2 min
PROGRESS> 04:08:41 |############### | 75% ( 84,330) 22.7 min
PROGRESS> 04:08:50 |################ | 80% ( 89,952) 22.9 min
PROGRESS> 04:09:00 |################# | 85% ( 95,574) 23.1 min
PROGRESS> 04:09:59 |################## | 90% (101,196) 24.0 min
PROGRESS> 04:11:08 |################### | 95% (106,818) 25.2 min
PROGRESS> 04:11:10 |####################| 100% (112,428) 25.2 min
Comments (13)
-
reporter -
reporter I attached a MRE.
I hacked
presto
and askedcollectEEQueue
inEstimateError.py
to outputdist
anddist_df
before line 367, the offending error, during an actual run that was experiencing the error.The error can then be reproduced locally as well:
import pickle import pandas as pd import numpy as np with open('debug_EE_dist.pkl', 'rb') as input: dist = pickle.load(input) with open('debug_EE_dist_df.pkl', 'rb') as input: dist_df = pickle.load(input) # line 367 # error happens thresh_df = pd.DataFrame.from_dict({'thresh': {'ALL': dist_df.index[np.argmax(dist) + \ int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \ if dist[index] == np.min(dist)]))]} })
-
reporter - changed title to Fatal `EstimateError.py set` error after progress bar reads 100%
-
reporter Man, that
thresh_df
line is insane!Traced the error down to:
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]))
Apparently,
`
[index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]
` evaluates to[]
.As a result,
int(np.mean([]))
complains aboutValueError: cannot convert float NaN to integer
.Looking at the list comprehension,
np.min(dist)
evaluates to0.0
The list comprehension evaluates to an empty list because
for index in np.argsort(dist[:int(len(dist)*0.75)]): print(dist[index]) 2204.0 2844.0 3120.0 3887.0 4270.0 4771.0 5883.0 9013.0 9371.0 9694.0 15772.0 16762.0 18100.0 20350.0 25545.0 25753.0 40180.0 41889.0 45282.0 45910.0 55368.0 56283.0 58865.0 63356.0 64246.0 64574.0 68575.0 69425.0 104577.0 134327.0 267810.0 378157.0 494295.0 893743.0 1436839.0 2773445.0 14354059.0
None of which equates
np.min(dist)
-
reporter Not sure if it matters:
I got this error when
- dumping the entire 12G of fastq (9+M sequences) to EE with
-n 20
- subsampling to 100000 sequences with
-n 5
I did not get an error when
- subsampling to 10000 sequences with
-n 20
(but with my data that seems useless an estimation because there was only 1 set with >20 sequences that “passed“)
- dumping the entire 12G of fastq (9+M sequences) to EE with
-
Potentially relevant pandas change in v0.24 that added support for
NaN
in integer vectors.https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
(And, yeah, that
thresh_df
line has some readability issues…)Also, potentially relevant as a stop-gap solution, you can specific
-f
to SplitSeq-sample to randomly sample evenly from sequences with the same value in the specified field. Eg,-f UMI -n 100
will sample up to 100 sequences from each unique UMI. You will change the read distribution, but it’s a good way to compress huge UMI groups that dominate the read counts. -
reporter Thanks. I tried the stop-gap, but the same error occurred. The calculation itself (-f INDEX_UID -n 100 ==> total 3444378 sets) was only under 10 min (according to the progress bar). Still, cool trick about
SplitSeq sample
that I didn’t know about!So is int(np.mean([])) supposed to be
0
in such a case? If so,np.argmax(dist)
evaluates to0
too here; and sodist_df.index[np.argmax(dist)+0]
evaluates to0.0
, which becomes the value ofTHRESH: ALL
. Does that mean thresh is supposed to be 1-0.0 = 1 here? Which basically means there’s no point to cluster by sequences?I’d debug more but I don’t know what exactly
thresh_df
is trying to do here, and the rest of EE.py is quite complex too.. -
I’m not sure. I’ll have to actually look at it. EE is a mix of very old code by me and some stuff Roy added later for the UMI collision methods. I’ll take a look a bit later… (day job keeps me pretty busy these days).
I seems feasible that the mean distance is indeed zero and it’s just not handling it properly. My gut would be some sort of numpy/pandas version issue if that’s case, because that should’ve popped up before.
-
reporter Phew. I spent some time investigating this as I (thought I) needed this to work for a current project. I don’t think it’s a numpy/pandas version issue. The core problem, if I’m not wrong, lies in the logic written in that
thresh_df
line. In addition to that, I also identified multiple other errors in the EE-related code. It’s a bit too complicated to explain them here, so I put them in this shared Benchling page.Also tagging @ruoyijiangyale since I think this is essentially his code?
-
Very nice. Yeah, @Roy Jiang , is probably the one to fix this issue. IIRC, there’s an assumption of a binomial distribution for pairwise distances (based on what you would get with random 4 characters strings) and those threshold windows are derived from that.
-
Hey Julian,
Welcome to this corner of presto. This looks like a bug in the code that you caught.#This... int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist)])) #Should be... int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist[:int(len(dist)*0.75)])])) #I'm guessing the cases where this was tested on np.min(dist) == np.min(dist[:int(len(dist)*0.75)])]) #so it was missed... #Probably should've written it... #consider the dist up to 0.75 of the dist profile and find index of the min value dist_window = dist[:int(len(dist)*0.75)] np.argmin(dist_window)
Regarding bigger issues with EstimateError.py set...
The goal is to solve a 1D minimum threshold identification problem with the lowest tech method possible.EstimateError.py set is challenging for the reasons you describe; because sequencing protocols have different error profiles and this was optimized using our protocols based on a few 2d histogram calculations from a long time ago.
The reason we didn't try so hard is because 0.8 is the widest cluster that CD-HIT-EST can cluster. So there was no point coming up with a sophisticated way to find the threshold when the open source (no vsearch/usearch!) high speed clustering technique had limitations anyways. I know from simulations, this method works though.
EstimateError.py the original was written before my time for calculating Illumina sequencing error profiles. Then we adapted it for this specific problem.
-
I added the code changes from @ruoyijiangyale in 5c2e000.
-
- changed status to resolved
Assuming we're good after 5c2e000.
- Log in to comment
</div> </form>