CustomVector is slow

Issue #108 resolved
Nils Deppe created an issue

Hi Klaus,

This is a revisit of my dislike of shared_ptr in CustomVector. Hopefully with some numbers I can convince you that CustomVector as is has rather limited use :) First, I've done a lot of measuring and optimizing performance in our code without Blaze and one thing I noticed in general is that std::shared_ptr is completely horrible for performance. Copying a shared_ptr is 14-15 times slower than copying a raw pointer on my system (Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz). Because of this I have redesigned a lot of the data structures to no longer use shared_ptr, trading the safety it might add in the name of performance. Yesterday I added Blaze into our core data structure. This data structure can either own a vector or point to a vector. Our design uses a raw pointer because this is very fast and light weight. Adding Blaze and re-running the exact same simulation (the only change is replacing the raw pointer with a CustomVector) adds a 25% performance overhead. Hopefully you can understand how this is unacceptable for us.

I'm interested to hear your thoughts on this!

Best, Nils

Comments (15)

  1. Klaus Iglberger

    Hi Nils!

    Thanks for raising this issue again. Can you please show how the data structure you mention looks like and give an example of how it is used? From your statement "... the only change is replacing the raw pointer with a CustomVector..." I currently conclude that you are copying a CustomVector just as you would copy a raw pointer. I this case I completely agree: In this scenario CustomVector will cause a significant performance overhead (you are lucky that it is just 25% overhead) and I agree that this is unacceptable for any kind of scientific code.

    Assuming that you copy CustomVector a lot, the problem is that conceptually a CustomVector is not a pointer, it is a vector (therefore the name). You have taken a look at the implementation details, have seen that it uses a std::shared_ptr, perceive it as a pointer, and now argue that the implementation causes overhead in comparison to a raw pointer. But this is a very unfair comparison: copying any kind of vector (blaze::StaticVector, blaze::DynamicVector, std::vector, ...) will be (much) more expensive than copying a pointer. This is to be expected and is in the nature of a value type.

    If my assumption is correct, I don't see the problem on our side or in the implementation of CustomVector, but on your side: Please don't perceive a CustomVector as a pointer, but consider it to be a vector. Please don't pass a CustomVector by value, but by reference (as you should pass any user defined type in C++ unless you have specific guarantees). In this way it will not cause any kind of overhead and will be as efficient as the other vector types. If my assumption is incorrect, please give me some idea of how you use CustomVector and how the overhead is caused.

    Best regards,

    Klaus!

  2. Klaus Iglberger

    Hi Nils!

    Please update Blaze to the latest repository version. In commit 8ebba75 we have implemented a slight change in CustomVector to minimize the setup times for unmanaged custom vectors. This change has a very positive side effect on copy operations for unmanaged custom vectors.

    We hope that with this change the copy overhead is acceptable. Copying a CustomVector will still be more expensive than copying a raw pointer, but it will not result in the expensive std::shared_ptr copy operation. In our tests we could only make out a difference between pass-by-value and pass-by-reference for small vectors (N < 100). Still, we would still recommend to pass-by-reference to guarantee a zero overhead.

    Best regards,

    Klaus!

  3. Klaus Iglberger

    Hi Nils!

    Some more measurements have made me realize that you might have been a victim of the setup times of CustomVector. In that case I apologize for my initial wrong analysis and admit that you had every right to complain about the performance of CustomVector. The good news is that commit 8ebba75 should indeed solve the problem.

    I would appreciate a short feedback whether the overhead is gone now and whether you are now able to work with CustomVector in your application. If you are unhappy about how I handled this issue I understand (again, I apologize). I will mark this issue as resolved in about a day.

    Best regards,

    Klaus!

  4. Nils Deppe reporter

    Hi Klaus,

    I've been looking into these things more and benchmarking more. I'll give a brief reply now and in a couple days (maybe sooner) something with more data and also an implementation of what I was hoping to use.

    Indeed I gave the wrong impression in my initial post, it is the construction of a shared_ptr that is already very expensive. I did some measurements with GoogleBenchmark:

    static void bench_shared_ptr(benchmark::State &state) {
      int temp = 5;
      while (state.KeepRunning()) {
        std::shared_ptr<int> ptr(&temp, [](auto){});
        benchmark::DoNotOptimize(ptr.get());
      }
    }
    BENCHMARK(bench_shared_ptr);
    
    static void bench_ptr(benchmark::State &state) {
      int temp = 5;
      while (state.KeepRunning()) {
        int* ptr = &temp;
        benchmark::DoNotOptimize(ptr);
      }
    }
    BENCHMARK(bench_ptr);
    

    and here are the results:

    --------------------------------------------------------
    Benchmark                 Time           CPU Iterations
    --------------------------------------------------------
    bench_shared_ptr         38 ns         38 ns   18483196
    bench_ptr                 2 ns          2 ns  325625176
    

    Note that in this case shared_ptr is NOT copied. The behavior is somewhat different between the two pointer cases. I am well aware of the difference, but I never encounter this difference and so I don't want to pay the overhead of shared_ptr.

    With your commit 8ebba75 I see only a 12% regression (thank you very much for the quick fix!), which is not great but I haven't looked yet as to what is happening. I've managed to make pretty good progress on getting a class together that couples into the expression templates of Blaze but is just a raw pointer and a size_t, so it has zero overhead.

    Best,

    Nils

  5. Nils Deppe reporter

    Hi Klaus,

    I found an issue with my code and was able to bring the regression down to ~3% with 8ebba75. I'm not sure if you expected zero overhead or not. I did implement a PointerVector that plugs into Blaze which is zero overhead. Would you be interested in this? It's rather simple. It's basically the same as CustomVector except that it only has the raw pointer and a size. I think having this in Blaze could potentially be useful to other people and naming it something like PointerVector with additional documentation that the class is a zero-overhead pointer wrapper that adds expression templates to pointers should make the associated risks clear. This is ultimately what I was looking for from Blaze and was hoping CustomVector would do, though I think there might be room for both (unless CustomVector can be made zero overhead while retaining the value semantics).

    What are your thoughts?

    Cheers,

    Nils

  6. Klaus Iglberger

    Hi Nils!

    Thanks for your reply. I'm glad that the regression is much smaller, but surprised that it is not entirely gone (i.e. 0% overhead). I would like to understand this in more detail. Therefore I would like to ask you a favor (since I don't have your application): Please update CustomVector in two steps to give me a better understanding of the problem in your case:

    Step 1: Please comment out or remove (without ill effect) all lines containing mv_ in CustomVector (that should be 30 lines). My expectation is that this should not change the regression. If it does, I would be interested in the compiler you are using.

    Step 2: Please comment out or remove the if conditions in the constructors on line 787 (for an unpadded CustomVector) and line 3216 (for a padded CustomVector). I expect these checks to cause the overhead.

    Also, I would be very much interested to learn how frequently you create a CustomVector and which operations you use it for after has been created. I hope I'm not asking too much. Thanks a lot,

    Best regards,

    Klaus!

  7. Nils Deppe reporter

    Hi Klaus,

    I just checked both of your suggestions and I needed to remove the shared_ptr from CustomVector to get zero overhead. I'm not sure why this is an issue, possible class size, or some other aspect that allows the compiler to more aggressively optimize a class that contains only two fundamental types. Let me know if there's anything else you'd like me to check!

    We are creating this fairly frequently, several times per time step. We use a large contiguous vector to hold data for all our evolved variables and then point into that vector. Creating temporaries, etc. means this happens at least once per time step, depending on the equations we're solving.

    Cheers,

    Nils

  8. Klaus Iglberger

    Hi Nils!

    Thanks for your results. Please allow me two more questions: Which compiler do you usually use? And which template parameters do you pass to CustomVector (ptr1, ptr2, ptr3, or ptr4)?

    blaze::CustomVector<T,aligned,padded> ptr1;      // Creating an aligned, padded custom vector
    blaze::CustomVector<T,aligned,unpadded> ptr2;    // Creating an aligned, unpadded custom vector
    blaze::CustomVector<T,unaligned,padded> ptr3;    // Creating an unaligned, padded custom vector
    blaze::CustomVector<T,unaligned,unpadded> ptr4;  // Creating an unaligned, unpadded custom vector
    

    Best regards,

    Klaus!

  9. Nils Deppe reporter

    Hey Klaus,

    Ask as many as you need :) We're currently moving the code over to GitHub open sourcing it, so as soon as that's done it'll be easier for us to compare things.

    I'm using:

    blaze::CustomVector<double, blaze::unaligned, blaze::unpadded>
    

    and GCC 7.1.1 with the following flags: -std=c++14 -march=native -O3 -g -DNDEBUG (and a bunch of warning flags)

    Cheers,

    Nils

  10. Klaus Iglberger

    Hi Nils!

    We have run multiple test scenarios and can confirm your findings. Despite commit 8ebba75, which seems to have fixed all initialization problems, we have found that the size of CustomVector can indeed influence the performance. Also, this does not seem to be a compiler specific problem, since both gcc and clang show it very clearly.

    This is to let you know that we take the problem seriously and work on a solution. We are still deciding what will be the best course of action to solve the issue.

    Best regards,

    Klaus!

  11. Nils Deppe reporter

    Hi Klaus!

    I'm glad to hear you are able to reproduce what I'm finding. I'm interested in seeing what you come up with as a solution! I have also observed it with Clang too.

    Best,

    Nils

  12. Klaus Iglberger

    The performance problems with CustomVector have been resolved. With commit 32cd60f CustomVector has been stripped of its shared ownership capabilities. For now it is not possible anymore to transfer the responsibility for some memory to a Blaze vector, but the feature will be retrofitted in a different form in a later commit. The same refactoring will be also be applied to CustomMatrix.

    The fix is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.2.

  13. Log in to comment