Source

nrnr / perfnotes.txt

Diff from to

File perfnotes.txt

 Memory needed to create the 10mill array of points (baseline). So hood and results
 are amount of memory over this.
     470832maxresident
+
+
+
+Fixed the incorrectness issues. Create time for uint8 is not <1sec. But I think
+it can still be reduced. Want to profile differnt sections first.
+
+    -- with 255 chunks in uint8
+    Create neighborhood: 1.978 sec
+    Search complete 638857: 0.080 sec
+    Cleanup: 0.009 sec
+    725264maxresident
+
+    -- with 65k chunks in uint16
+    Create neighborhood: 4.037 sec
+    Search complete 638857: 0.056 sec
+    Cleanup: 0.015 sec
+    960112maxresident
+
+    -- with 100k chunks in uint32
+    Create neighborhood: 4.370 sec
+    Search complete 638857: 0.062 sec
+    Cleanup: 0.022 sec
+    1428320maxresident
+
+
+The init could still be threaded to cut it down more.
+Actually, with the chunking the search and create could be easily threaded by chunk.
+
+With the uint16 indices (and 32k chunks)
+10r = Search complete 10034: 0.010 sec
+20r = Search complete 79836: 0.025 sec
+40r = Search complete 638857: 0.055 sec
+80r = Search complete 5118520: 0.100 sec
+
+
+Next up, speed up the create. Is it the filling of the initial numbered array? If so that may
+be sped up using memcpy to duplicate each chunk?
+
+
+If I break the data into chunks AFTER sorting, could that help things? PERHAPS!
+Currently all chunks are finding 20k points on the min axis and resulting in about 4k.
+That is from the 65k chunks of uint16.
+
+Perhaps optimizing the uint8 case by finding quicky early outs for scanning sections?
+
+
+Moving uint16 from 65k chunks to 32k gave speedup to larger search radius, but had no
+effect on the 10r and 20r searches.