Generated views

Issue #133 new
john bachan created an issue

Having to write iterators because one doesn't exist to traverse your data layout is a huge burden. How would one build a column view of a float[100][100] matrix? They could create a strided_ptr<T> pointer-like iterator that knows to jump by a certain number of bytes when doing pointer arithmetic, but that's a lot of code to write. Providing upcxx::strided_ptr would probably service a lot of needs, but I'de like to give users something even more general: a view that works with an explicit length and generating function instead of iterator pair.

float matrix[10][10];

for(int col=0; col < 10; col++) {
  upcxx::rpc(rank, 
    [](view<float> column) {
      for(int row=0; row < 10; row)
        consume(column[row]);
    },
    upcxx::make_view(
      /*length=*/10,
      [&](upcxx::view_emitter<float> emit) {
        for(int row=0; row < 10; row++)
          emit(matrix[row][col]);
      }
    )
  );
}

For trivial types T, we should be able to make the above code work and perform as expected. Performance gets trickier with non-trivial T because serialization requires that a type knows how to compute an upper-bound of the buffer space it will need on the wire before it gets told to serialize. For instance, serializing a std::vector<std::string> is two-pass, one to sum up the needed buffer space, the second to fill in the buffer (this is expected to perform better than a single pass over a growable buffer). Iterated views also do this two-pass, and since the iterator is Forward we know thats safe. Accordingly, generated views would also need to be two-pass, meaning that the users lambda would be invoked twice for different purposes. For max performance we would want the lambda to be specialized (in terms of code-gen) for each context, the first where its summing buffer upper bounds of each T and the second where its recursively invoking serialization of each T to the buffer. Without specialization, we would have to hide a flag in view_emitter<T> that knows if its summing or serializing. That might perform fine, since the compiler could conceivable hoist the void emit(T x) { if(summing_not_serializing) {...} else {...}} condition out of the users emit-loop, thus building our specialized contexts for us, but that involves trusting the compiler. To guarantee specialization we would like the user lambda to be templated on the emitter type. C++14 allows templated lambdas by use of the auto keyword on the arguments. C++11 users are just SOL, they would either have to write their own functor class with templated operator() or fallback to non-template-specialized lambda that hopes the compiler's inlining and hoisting is aggressive enough.

Proposed view_emitter<T>:

////////////////////////////////////////////////////////////////////////////////////
// in upcxx
struct view_emitter_unknown;
struct view_emitter_ubound;
struct view_emitter_serialize;

template<typename T, typename Mode = view_emitter_unknown>
class view_emitter;

template<typename T>
class view_emitter<T, view_emitter_unknown> {
  bool ubound_not_serialize_;
  void emit(T const &x) {
    if(ubound_not_serialize_) {...}
    else {...}
  }
};

template<typename T>
class view_emitter<T, view_emitter_ubound> {
  void emit(T const &x) {...}
};

template<typename T>
class view_emitter<T, view_emitter_serialize> {
  void emit(T const &x) {...}
};

////////////////////////////////////////////////////////////////////////////////////
// user

// C++14 user
upcxx::rpc(target,
  [](upcxx::view<T> foo) {...},
  upcxx::make_view(
    length,
    [](upcxx::view_emitter<T,auto> emit) {
      for(...) emit(...);
    }
  )
);

// C++11 user
upcxx::rpc(target,
  [](upcxx::view<T> foo) {...},
  upcxx::make_view(
    length,
    [](upcxx::view_emitter<T> emit) {
      for(...) emit(...);
    }
  )
);

// User who wants the best without knowing their C++ version,
// we have #define'd UPCXX_VIEW_EMITTER to use auto or not
// depending on the c++ version detected.
upcxx::rpc(target,
  [](upcxx::view<T> foo) {...},
  upcxx::make_view(
    length,
    [](UPCXX_VIEW_EMITTER(T) emit) {
      for(...) emit(...);
    }
  )
);

Comments (8)

  1. Dan Bonachea

    I understand your use case, but not the details of the proposed solution.

    What is the user contract for this new overload?

    make_view(size_t count, [](view_emitter<T> emit))
    

    Based on your discussion it sounds like the runtime would call the lambda one or more times with a polymorphic function object emit, and that upon each call the lambda is required to call emit.operator() exactly count times passing a T reference to each?

    This seems like an awkward interface from the user perspective. Why not something simpler like:

    make_view(size_t count, []() -> T&)
    

    Where the runtime calls the lambda exactly count times and each returns a reference to the appropriate T in the sequence? Ie for your column example, something like:

        int col = 0;
        upcxx::make_view(
          /*length=*/10,
          [&]() { return matrix[row][col++]; }
        );
    

    This seems both easier to understand and less error-prone, although it does not support multi-pass - so types with non-trivial serialization may need to use a dynamically grown buffer.

    I don't necessarily buy the argument that a two-pass algorithm is superior to a dynamically grown serialization buffer - especially for types where the serialization logic for T might involve lots of cache misses and/or computation more expensive than a data copy. I definitely don't like the idea of adopting a more complicated user interface motivated solely by the (theoretical) performance benefit to non-trivial types, which should be the uncommon case for most HPC codes anyhow.

    If we really think multi-pass is crucial, then here's another alternative:

        upcxx::make_view(
          /*length=*/10,
          [&](int col) { return matrix[row][col]; }
        );
    

    where the lambda argument is the 0-based index in the sequence.

  2. john bachan reporter

    Your alternatives only seems more attractive because the running example has an elegant mapping to/from the 1D index space. Things in your world breakdown when we look at sending an xz plane of a 3D array.

    float cube[10][10][10];
    int y = ...;
    
    // Dan #1
    int x = 0;
    int z = 0;
    make_view(100,
      // mutable but with x,z value captured allows multi-pass.
      [&,x,z]() mutable -> float { 
        float ans = cube[z][y][x];
        x++;
        if(x == 10) { x = 0; z++; }
        return ans;
      }
    )
    
    // Dan #2
    make_view(100,
      [&](int i) -> float {
        int x = i % 10;
        int z = i / 10;
        return cube[z][y][x];
      }
    )
    

    Neither of these are as intuitive or as performant as the loop nest mine would involve:

    make_view(100,
      [&](view_emitter<float> emit) {
        for(int z=0; z < 10; z++)
          for(int x=0; x < 10; x++)
            emit(cube[z][y][x]);
      }
    )
    

    I'm open to dropping the multi-pass implementation, but I don't think that would lift any constraints from the user since this lambda goes into serialization logic which must be side-effect free.

    I agree that the dynamic buffer does guard better against worst-case performance when the data has poor cache locality. Its also silly and potentially harmful to ask for the length when using a dynamic buffer because that could actually cause the user to pointlessly traverse their thing to get a population count, which we use to no benefit. How about we split the API, a trivial-only generatedmake_view and a dynamic generated make_view:

    struct view_emitter_bounded_tag;
    struct view_emitter_dynamic_tag;
    
    template<typename T, typename Tag>
    class view_emitter;
    
    template<typename T>
    using view_emitter_bounded = view_emitter<T, view_emitter_bounded_tag>;
    template<typename T>
    using view_emitter_dynamic = view_emitter<T, view_emitter_dynamic_tag>;
    
    // 3 popular use-cases, not necessarily function overloads. Case 1 and 3
    // would probably be the same overload.
    
    // 1. User knows type is trivial and the length was easy to compute.
    // static_assert's that is_trivial(T);
    make_view(size_t n, [](view_emitter_bounded<T> emit)->void);
    
    // 2. User knows nothing helpful.
    make_view([](view_emitter_dynamic<T> emit)->void);
    
    // 3. Generic code that doesn't know if T is trivial but does have easy
    // access to `n`. The lambda should be type-parametric in the `emit`
    // argument.
    // `auto` would be one of: bounded_tag | dynamic_tag | generic_tag
    make_view(size_t n, [](view_emitter<T,auto> emit)->void);
    
  3. Amir Kamil

    Do we have a motivating use case for this? It seems to me that VIS would be sufficient for most applications that communicate strided data. Even if an application would need to accumulate, it's unclear to me that an RPC over a view would be a win over allocating a landing zone and doing a VIS with RPC completion.

    I would like to see a compelling reason to provide this interface before we spend the effort to spec and implement it.

  4. john bachan reporter

    @akamil a (possibly poor) use case to consider is sending ghost zones in AMR. With VIS, you need a separate rput_strided per face. With an rpc you can deliver multiple faces in one message. Its then possible that rpc could be faster since it sends fewer messages. I called this a poor use case because AMR likes to have big boxes, resulting in few faces, meaning that the VIS message count wouldn't be very high in the first place. But I think this is still enough to motivate development of the feature.

  5. Amir Kamil

    I do not think that this case is sufficiently motivating. I also am in general opposed to adding new features without a high likelihood that someone will use them. That puts us on the hook for developing, optimizing, and supporting them, and I don't think we should take that on without a cost/benefit analysis.

  6. Scott Baden

    I agree with Amir and with Dan. We already have a feature (VIS) that is designed to do the job. For the scenarios we are aware of, VIS should more than meet our needs. Dan and Paul put considerable effort into the GASNet optimizer and we very much need to qualify VIS performance, to understand the benefits and any limitations, or any need for performance tuning.

  7. Dan Bonachea

    This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.

    We have some initial experience with user implementation of iterators and need some dedicated meeting time to discuss the topic. Currently we lack the resources to commit to such a feature.

  8. Log in to comment