- changed status to resolved
Performance degredataion with NearestNeighbors
There is a major TBB performance degradation when using the NearestNeighbors class. NearestNeghbors passes out boost::shared_arrays and classes like HexOrderParamter ask for these shared arrays in a tight loop every thread.
boost::shared_array has a mutex in the copy constructor and assignment operator. This makes the ref counting thread safe, but unbearably slow. On warhol, HexOrderParameter is clogged down to about single core performance even on 1 million particle data.
To resolve this, you need to not use boost::shared_array. See LinkCell where I had to make these changes once before (LinkCell used to use shared_array until I discovered this performance issue).
Comments (6)
-
-
reporter - changed status to open
No, not really taken care of. I still only get ~5 threads running on a vis lab machine when using HexOrderParameter and 1024^2 particles.
-
reporter Here is a profile trace:
About half the time was spent loading the file. As you can see, the other half of the time was spent in the nearest neighbors code. Compared to the hex order code from before the nearest neighbors update, it is ungodly slow.
I didn't run the profiler with line by line info, but I see a lot of potential issues just looking at the code. One, you have atomics for several of the arrays (which are never written to concurrently). Two, there are memory allocations/deallocations within the worker thread (the neighbors vector).
$ opreport -l -t 0.5 Using /nfs/glotzer/projects/hexatic-poly/scan/analyzer/oprofile_data/samples/ for samples directory. warning: /no-vmlinux could not be found. warning: [vdso] (tgid:13353 range:0x7fff489ff000-0x7fff489fffff) could not be found. CPU: Intel Architectural Perfmon, speed 3.6e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 4108721 57.2678 _freud.so python tbb::interface6::internal::start_for<tbb::blocked_range<unsigned long>, freud::locality::ComputeNearestNeighbors, tbb::auto_partitioner const>::execute() 853118 11.8908 no-vmlinux python /no-vmlinux 668543 9.3182 libc-2.19.so python __memcpy_sse2_unaligned 208276 2.9030 libc-2.19.so python malloc 179836 2.5066 _freud.so python freud::locality::compareRsqVectors(std::pair<float, unsigned int> const&, std::pair<float, unsigned int> const&) 155601 2.1688 libm-2.19.so python atanf 136928 1.9085 _freud.so python tbb::interface6::internal::start_for<tbb::blocked_range<unsigned long>, freud::order::ComputeHexOrderParameter, tbb::auto_partitioner const>::execute() 134388 1.8731 libc-2.19.so python _int_free 126690 1.7658 libm-2.19.so python sincosf 102808 1.4329 _freud.so python freud::locality::LinkCell::computeCellNeighbors() 77892 1.0857 libc-2.19.so python _int_malloc 66735 0.9302 libm-2.19.so python cexpf 61831 0.8618 _freud.so python freud::locality::LinkCell::computeCellList(freud::trajectory::Box&, vec3<float> const*, unsigned int)
-
I need a break from all the work I've done getting CUDA into freud, so I'll take a look at it.
-
I think I have this fixed now. The iterator appears to be working correctly and I get full 20 cores. The profiler isn't exactly helpful with rnd data as everything is in a single cell, so a lot of the time, if not all, is spent within the sort call, but I get full 20 cores.
-
- changed status to resolved
Resolves issue
#22Merged in NeighborListFix (pull request #19)
Actually fixed the neighbor list this time
→ <<cset 98dfb22ed19d>>
- Log in to comment
Taken care of