CustomVector is slow
Hi Klaus,
This is a revisit of my dislike of shared_ptr
in CustomVector
. Hopefully with some numbers I can convince you that CustomVector
as is has rather limited use :) First, I've done a lot of measuring and optimizing performance in our code without Blaze and one thing I noticed in general is that std::shared_ptr
is completely horrible for performance. Copying a shared_ptr
is 14-15 times slower than copying a raw pointer on my system (Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
). Because of this I have redesigned a lot of the data structures to no longer use shared_ptr
, trading the safety it might add in the name of performance. Yesterday I added Blaze into our core data structure. This data structure can either own a vector or point to a vector. Our design uses a raw pointer because this is very fast and light weight. Adding Blaze and re-running the exact same simulation (the only change is replacing the raw pointer with a CustomVector) adds a 25% performance overhead. Hopefully you can understand how this is unacceptable for us.
I'm interested to hear your thoughts on this!
Best, Nils
Comments (15)
-
-
Hi Nils!
Please update Blaze to the latest repository version. In commit 8ebba75 we have implemented a slight change in
CustomVector
to minimize the setup times for unmanaged custom vectors. This change has a very positive side effect on copy operations for unmanaged custom vectors.We hope that with this change the copy overhead is acceptable. Copying a
CustomVector
will still be more expensive than copying a raw pointer, but it will not result in the expensivestd::shared_ptr
copy operation. In our tests we could only make out a difference between pass-by-value and pass-by-reference for small vectors (N < 100). Still, we would still recommend to pass-by-reference to guarantee a zero overhead.Best regards,
Klaus!
-
Hi Nils!
Some more measurements have made me realize that you might have been a victim of the setup times of
CustomVector
. In that case I apologize for my initial wrong analysis and admit that you had every right to complain about the performance ofCustomVector
. The good news is that commit 8ebba75 should indeed solve the problem.I would appreciate a short feedback whether the overhead is gone now and whether you are now able to work with
CustomVector
in your application. If you are unhappy about how I handled this issue I understand (again, I apologize). I will mark this issue as resolved in about a day.Best regards,
Klaus!
-
reporter Hi Klaus,
I've been looking into these things more and benchmarking more. I'll give a brief reply now and in a couple days (maybe sooner) something with more data and also an implementation of what I was hoping to use.
Indeed I gave the wrong impression in my initial post, it is the construction of a shared_ptr that is already very expensive. I did some measurements with GoogleBenchmark:
static void bench_shared_ptr(benchmark::State &state) { int temp = 5; while (state.KeepRunning()) { std::shared_ptr<int> ptr(&temp, [](auto){}); benchmark::DoNotOptimize(ptr.get()); } } BENCHMARK(bench_shared_ptr); static void bench_ptr(benchmark::State &state) { int temp = 5; while (state.KeepRunning()) { int* ptr = &temp; benchmark::DoNotOptimize(ptr); } } BENCHMARK(bench_ptr);
and here are the results:
-------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------- bench_shared_ptr 38 ns 38 ns 18483196 bench_ptr 2 ns 2 ns 325625176
Note that in this case shared_ptr is NOT copied. The behavior is somewhat different between the two pointer cases. I am well aware of the difference, but I never encounter this difference and so I don't want to pay the overhead of shared_ptr.
With your commit 8ebba75 I see only a 12% regression (thank you very much for the quick fix!), which is not great but I haven't looked yet as to what is happening. I've managed to make pretty good progress on getting a class together that couples into the expression templates of Blaze but is just a raw pointer and a size_t, so it has zero overhead.
Best,
Nils
-
reporter Hi Klaus,
I found an issue with my code and was able to bring the regression down to ~3% with 8ebba75. I'm not sure if you expected zero overhead or not. I did implement a
PointerVector
that plugs into Blaze which is zero overhead. Would you be interested in this? It's rather simple. It's basically the same asCustomVector
except that it only has the raw pointer and a size. I think having this in Blaze could potentially be useful to other people and naming it something likePointerVector
with additional documentation that the class is a zero-overhead pointer wrapper that adds expression templates to pointers should make the associated risks clear. This is ultimately what I was looking for from Blaze and was hopingCustomVector
would do, though I think there might be room for both (unlessCustomVector
can be made zero overhead while retaining the value semantics).What are your thoughts?
Cheers,
Nils
-
Hi Nils!
Thanks for your reply. I'm glad that the regression is much smaller, but surprised that it is not entirely gone (i.e. 0% overhead). I would like to understand this in more detail. Therefore I would like to ask you a favor (since I don't have your application): Please update
CustomVector
in two steps to give me a better understanding of the problem in your case:Step 1: Please comment out or remove (without ill effect) all lines containing
mv_
inCustomVector
(that should be 30 lines). My expectation is that this should not change the regression. If it does, I would be interested in the compiler you are using.Step 2: Please comment out or remove the if conditions in the constructors on line 787 (for an unpadded
CustomVector
) and line 3216 (for a paddedCustomVector
). I expect these checks to cause the overhead.Also, I would be very much interested to learn how frequently you create a
CustomVector
and which operations you use it for after has been created. I hope I'm not asking too much. Thanks a lot,Best regards,
Klaus!
-
reporter Hi Klaus,
I just checked both of your suggestions and I needed to remove the
shared_ptr
fromCustomVector
to get zero overhead. I'm not sure why this is an issue, possible class size, or some other aspect that allows the compiler to more aggressively optimize a class that contains only two fundamental types. Let me know if there's anything else you'd like me to check!We are creating this fairly frequently, several times per time step. We use a large contiguous vector to hold data for all our evolved variables and then point into that vector. Creating temporaries, etc. means this happens at least once per time step, depending on the equations we're solving.
Cheers,
Nils
-
Hi Nils!
Thanks for your results. Please allow me two more questions: Which compiler do you usually use? And which template parameters do you pass to
CustomVector
(ptr1
,ptr2
,ptr3
, orptr4
)?blaze::CustomVector<T,aligned,padded> ptr1; // Creating an aligned, padded custom vector blaze::CustomVector<T,aligned,unpadded> ptr2; // Creating an aligned, unpadded custom vector blaze::CustomVector<T,unaligned,padded> ptr3; // Creating an unaligned, padded custom vector blaze::CustomVector<T,unaligned,unpadded> ptr4; // Creating an unaligned, unpadded custom vector
Best regards,
Klaus!
-
reporter Hey Klaus,
Ask as many as you need :) We're currently moving the code over to GitHub open sourcing it, so as soon as that's done it'll be easier for us to compare things.
I'm using:
blaze::CustomVector<double, blaze::unaligned, blaze::unpadded>
and GCC 7.1.1 with the following flags:
-std=c++14 -march=native -O3 -g -DNDEBUG
(and a bunch of warning flags)Cheers,
Nils
-
Hi Nils!
We have run multiple test scenarios and can confirm your findings. Despite commit 8ebba75, which seems to have fixed all initialization problems, we have found that the size of
CustomVector
can indeed influence the performance. Also, this does not seem to be a compiler specific problem, since both gcc and clang show it very clearly.This is to let you know that we take the problem seriously and work on a solution. We are still deciding what will be the best course of action to solve the issue.
Best regards,
Klaus!
-
reporter Hi Klaus!
I'm glad to hear you are able to reproduce what I'm finding. I'm interested in seeing what you come up with as a solution! I have also observed it with Clang too.
Best,
Nils
-
-
assigned issue to
-
assigned issue to
-
- changed status to open
-
- changed status to resolved
The performance problems with
CustomVector
have been resolved. With commit 32cd60fCustomVector
has been stripped of its shared ownership capabilities. For now it is not possible anymore to transfer the responsibility for some memory to a Blaze vector, but the feature will be retrofitted in a different form in a later commit. The same refactoring will be also be applied toCustomMatrix
.The fix is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.2.
-
reporter Hi Klaus,
Thanks for the fix!
Nils
- Log in to comment
Hi Nils!
Thanks for raising this issue again. Can you please show how the data structure you mention looks like and give an example of how it is used? From your statement "... the only change is replacing the raw pointer with a
CustomVector
..." I currently conclude that you are copying aCustomVector
just as you would copy a raw pointer. I this case I completely agree: In this scenarioCustomVector
will cause a significant performance overhead (you are lucky that it is just 25% overhead) and I agree that this is unacceptable for any kind of scientific code.Assuming that you copy
CustomVector
a lot, the problem is that conceptually aCustomVector
is not a pointer, it is a vector (therefore the name). You have taken a look at the implementation details, have seen that it uses astd::shared_ptr
, perceive it as a pointer, and now argue that the implementation causes overhead in comparison to a raw pointer. But this is a very unfair comparison: copying any kind of vector (blaze::StaticVector
,blaze::DynamicVector
,std::vector
, ...) will be (much) more expensive than copying a pointer. This is to be expected and is in the nature of a value type.If my assumption is correct, I don't see the problem on our side or in the implementation of
CustomVector
, but on your side: Please don't perceive aCustomVector
as a pointer, but consider it to be a vector. Please don't pass aCustomVector
by value, but by reference (as you should pass any user defined type in C++ unless you have specific guarantees). In this way it will not cause any kind of overhead and will be as efficient as the other vector types. If my assumption is incorrect, please give me some idea of how you useCustomVector
and how the overhead is caused.Best regards,
Klaus!