Source

nrnr / perfnotes.txt

nrnr
Create neighborhood: 15.615 sec
80r = Search block 5120872: 0.296 sec
40r = Search block 639449: 0.114 sec
20r = Search block 79971: 0.047 sec
10r = Search block 20031: 0.021 sec
Cleanup: 0.012 sec
3282352maxresident

40r 4d: sig11
40r 3d: .119 sec
40r 2d: .057 sec
40r 1d: .023 sec

kdtree
inserting 10000000 random vectors... 29.000 sec
80r = range query returned 2680049 items in 1.02700 sec
40r = range query returned 334901 items in 0.16200 sec
20r = range query returned 41902 items in 0.02000 sec
10r = range query returned 5175 items in 0.00300 sec
cleaned up in 2.32100 sec
3169120maxresident

nanoflann
Create Kdtree: 12.735sec
80r = Search: 2680049, 0.512sec
40r = Search: 334901, 0.076sec  ! crazy, similar speed between 20 and 40 unit radius !
20r = Search: 41902, 0.07sec
10r = Search: 5175, 0.001sec
Cleanup: 0.0279999sec
1180688maxresident



On the creation of the neighborhood. Reduce the 'inverse' to a single array instead
of one per dimension. This shortens create time from 16.5s to 12.0s. Reduces memory
by 76mb
    Create neighborhood: 12.069 sec
    Search block 639983: 0.116 sec
    Cleanup: 0.011 sec
    2658256maxresident



Area of a unit sphere is ~.52. Validated by the number of items found
in the rectangular search compared to the spherical search of kdtree and nanoflann



Minimize the code inside the search loop. Gives about 20% performance.
Not digging in the right place, but gets the 40radius test to .1 second.
    Create neighborhood: 12.076 sec
    Search block 639983: 0.09 sec
    Cleanup: 0.010 sec
    2657216maxresident




Switched the code to work in 'chunks'. This breaks the array of point numbers
into small sequential ranges. The algorithm wins with these smaller sections.
The other win is that all the array offsets now only index the chunk. By making
the chunk size smaller we can fit the indices into uint8 or a uint16. Cuts a
good 40% off of runtime and really drops memory. We could use still significantly
less memory by using some form of stack to store the results in, currently makes
a size_t array of numPoints size, which means a good 40MB.

Notice the sizes of results is varying here, so there is some problem which could
strike these results as irrelevent. I am highly dubious since these are getting
about half of the results as the unchunked.

Wow, look at the time needed to build the hood. Super fast, and that is (?) valid time.
(Hulk Hogan eat your heart out!)

    -- with 10k chunks in size_t index
    Create neighborhood: 3.332 sec
    Search complete 640842: 0.066 sec
    Cleanup: 0.008 sec
    2366256maxresident

    -- with 10k chunks in uint16 index
    Create neighborhood: 2.942 sec
    Search complete 592354: 0.054 sec
    Cleanup: 0.010 sec
    926848maxresident

    -- with 255 chunks in uint8 index
    Create neighborhood: 0.887 sec
    Search complete 332549: 0.053 sec
    Cleanup: 0.008 sec
    715744maxresident



Memory needed to create the 10mill array of points (baseline). So hood and results
are amount of memory over this.
    470832maxresident



Fixed the incorrectness issues. Create time for uint8 is not <1sec. But I think
it can still be reduced. Want to profile differnt sections first.

    -- with 255 chunks in uint8
    Create neighborhood: 1.978 sec
    Search complete 638857: 0.080 sec
    Cleanup: 0.009 sec
    725264maxresident

    -- with 65k chunks in uint16
    Create neighborhood: 4.037 sec
    Search complete 638857: 0.056 sec
    Cleanup: 0.015 sec
    960112maxresident

    -- with 100k chunks in uint32
    Create neighborhood: 4.370 sec
    Search complete 638857: 0.062 sec
    Cleanup: 0.022 sec
    1428320maxresident


The init could still be threaded to cut it down more.
Actually, with the chunking the search and create could be easily threaded by chunk.

With the uint16 indices (and 32k chunks)
10r = Search complete 10034: 0.010 sec
20r = Search complete 79836: 0.025 sec
40r = Search complete 638857: 0.055 sec
80r = Search complete 5118520: 0.100 sec


Next up, speed up the create. Is it the filling of the initial numbered array? If so that may
be sped up using memcpy to duplicate each chunk?


If I break the data into chunks AFTER sorting, could that help things? PERHAPS!
Currently all chunks are finding 20k points on the min axis and resulting in about 4k.
That is from the 65k chunks of uint16.

Perhaps optimizing the uint8 case by finding quicky early outs for scanning sections?


Moving uint16 from 65k chunks to 32k gave speedup to larger search radius, but had no
effect on the 10r and 20r searches.

97% of the search time is spent in the main search loop.
No need to focus on optimizing the bisects.

Searching 2 out of 3 dimensions still takes up most of the time.
This is good, because it confirms intersecting the first two axis
takes the majority of time.
Single axis finds: 3976488  .017 seconds
Second axis finds: 1594896  .042 seconds
All three finds:    638857  .054 seconds


with uint8 and 255 chunks, .026 time is bisecting and .048 is spent in searching (64%)
with uint8 and 64 chunks,  .052 time is bisecting and .048 is spent searching (48%)
Note that there are NONE (0) chunks that end up with 0 points in any axis. LAMEO! (40 radius)