In the proper nfer semantics, pools are sets of intervals. In the code, however, pools are really just fancy lists. This difference results in more intervals being generated than we might like when minimality is turned off (or sometimes when it’s on, but this is less likely). There is a reasonable argument to be made about whether we should care about duplicate intervals and right now, obviously, we do not. However, to really match the language semantics it should not be possible to have more than one identical interval output.
To solve this generally might be a bit much since the operational semantics differ a bit due to the online addition of intervals, but it would be mostly solved by modifying pools so that they were really sets.
This could be done in one of maybe three ways:
- sort_pool (or maybe purge_pool) could be modified to throw out duplicates. This has the advantage that it doesn’t affect the cost of adding an interval.
- pools could be kept ordered at all times and duplicates not allowed to be added. If we could keep the cost of keeping pools sorted down, then this might be the best option. It would be nice not to have to call sort explicitly. This might actually slow things down overall, though.
- pools continue to be unordered and duplicates not allowed to be added. It is pretty arguable why this would make more sense than option 2, but one could imagine a hash key for an interval that doesn’t necessarily keep a natural ordering. Compute the hash key, check to see if it is already in the pool, don’t add it if so.
A natural hash key would multiply end times a large prime, begin times a smaller prime, etc. The point is that this would create a natural ordering. However, timestamps may already become very large and so large multiplication might overflow. It’s not clear how to reconcile that exactly. Maybe it wouldn’t be necessary if keeping things sorted. Just have a function that compares them as it is done now and then if the times are equal check the map, which can keep a hashkey.
It wouldn’t be hard to keep a pointer to the middle of the pool to then search for an insertion point if computing the comparison is cheap.