Performance degredataion with NearestNeighbors

Issue #22 resolved
Joshua Anderson created an issue

There is a major TBB performance degradation when using the NearestNeighbors class. NearestNeghbors passes out boost::shared_arrays and classes like HexOrderParamter ask for these shared arrays in a tight loop every thread.

boost::shared_array has a mutex in the copy constructor and assignment operator. This makes the ref counting thread safe, but unbearably slow. On warhol, HexOrderParameter is clogged down to about single core performance even on 1 million particle data.

To resolve this, you need to not use boost::shared_array. See LinkCell where I had to make these changes once before (LinkCell used to use shared_array until I discovered this performance issue).

Comments (6)

  1. Joshua Anderson reporter
    • changed status to open

    No, not really taken care of. I still only get ~5 threads running on a vis lab machine when using HexOrderParameter and 1024^2 particles.

  2. Joshua Anderson reporter

    Here is a profile trace:

    About half the time was spent loading the file. As you can see, the other half of the time was spent in the nearest neighbors code. Compared to the hex order code from before the nearest neighbors update, it is ungodly slow.

    I didn't run the profiler with line by line info, but I see a lot of potential issues just looking at the code. One, you have atomics for several of the arrays (which are never written to concurrently). Two, there are memory allocations/deallocations within the worker thread (the neighbors vector).

    $ opreport -l -t 0.5
    Using /nfs/glotzer/projects/hexatic-poly/scan/analyzer/oprofile_data/samples/ for samples directory.
    warning: /no-vmlinux could not be found.
    warning: [vdso] (tgid:13353 range:0x7fff489ff000-0x7fff489fffff) could not be found.
    CPU: Intel Architectural Perfmon, speed 3.6e+06 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
    samples  %        image name               app name                 symbol name
    4108721  57.2678  _freud.so                python                   tbb::interface6::internal::start_for<tbb::blocked_range<unsigned long>, freud::locality::ComputeNearestNeighbors, tbb::auto_partitioner const>::execute()
    853118   11.8908  no-vmlinux               python                   /no-vmlinux
    668543    9.3182  libc-2.19.so             python                   __memcpy_sse2_unaligned
    208276    2.9030  libc-2.19.so             python                   malloc
    179836    2.5066  _freud.so                python                   freud::locality::compareRsqVectors(std::pair<float, unsigned int> const&, std::pair<float, unsigned int> const&)
    155601    2.1688  libm-2.19.so             python                   atanf
    136928    1.9085  _freud.so                python                   tbb::interface6::internal::start_for<tbb::blocked_range<unsigned long>, freud::order::ComputeHexOrderParameter, tbb::auto_partitioner const>::execute()
    134388    1.8731  libc-2.19.so             python                   _int_free
    126690    1.7658  libm-2.19.so             python                   sincosf
    102808    1.4329  _freud.so                python                   freud::locality::LinkCell::computeCellNeighbors()
    77892     1.0857  libc-2.19.so             python                   _int_malloc
    66735     0.9302  libm-2.19.so             python                   cexpf
    61831     0.8618  _freud.so                python                   freud::locality::LinkCell::computeCellList(freud::trajectory::Box&, vec3<float> const*, unsigned int)
    
  3. Eric Harper

    I think I have this fixed now. The iterator appears to be working correctly and I get full 20 cores. The profiler isn't exactly helpful with rnd data as everything is in a single cell, so a lot of the time, if not all, is spent within the sort call, but I get full 20 cores.

  4. Log in to comment