Memory needed to create the 10mill array of points (baseline). So hood and results
are amount of memory over this.
+Fixed the incorrectness issues. Create time for uint8 is not <1sec. But I think
+it can still be reduced. Want to profile differnt sections first.
+ -- with 255 chunks in uint8
+ Create neighborhood: 1.978 sec
+ Search complete 638857: 0.080 sec
+ -- with 65k chunks in uint16
+ Create neighborhood: 4.037 sec
+ Search complete 638857: 0.056 sec
+ -- with 100k chunks in uint32
+ Create neighborhood: 4.370 sec
+ Search complete 638857: 0.062 sec
+The init could still be threaded to cut it down more.
+Actually, with the chunking the search and create could be easily threaded by chunk.
+With the uint16 indices (and 32k chunks)
+10r = Search complete 10034: 0.010 sec
+20r = Search complete 79836: 0.025 sec
+40r = Search complete 638857: 0.055 sec
+80r = Search complete 5118520: 0.100 sec
+Next up, speed up the create. Is it the filling of the initial numbered array? If so that may
+be sped up using memcpy to duplicate each chunk?
+If I break the data into chunks AFTER sorting, could that help things? PERHAPS!
+Currently all chunks are finding 20k points on the min axis and resulting in about 4k.
+That is from the 65k chunks of uint16.
+Perhaps optimizing the uint8 case by finding quicky early outs for scanning sections?
+Moving uint16 from 65k chunks to 32k gave speedup to larger search radius, but had no
+effect on the 10r and 20r searches.