Most of the Fault-Free logic, a small part of the Fault Tolerant logic, and a good part of the mechanical part for the ERA.
Allow for new prototype for agreement: value is now a couple 32 bits bitwise ANDed, and a return value in the OMPI_ERR realm
Fault Free logic + some debugging.
Silence an extremely annoying warning.
Create component initalization/finalization for persistent consensus information when using ERA.
Remove ft_data field from communicator, fix Failure-Free case for ERA, and implement a couple of topologies (FF case). FIXME: still using wrong approach to find communicator-specific & module-specific data from the contextid.
Remove ft_data field from communicator.
Keep the module information in a separate database as it is not necessarily inside coll.coll_agreement_module.
Rename epoch in c_epoch, remove warning and update comments
Fix module localization issue: module is now stored in hash table. Fix some collision bugs. WIP.
Fix few minor issues.
Add the benchmark for agree (temporarily)
Working around buggy / error-prone hash table implementation. See comment in code.
Huge simplification of the data structures: now only 3 hash tables: one for the passed agreements, one for the ongoing, and one for the EPOCH management. Nothing more. ERA-agreement-specific data is stored in the hash table, not in the ft_module.
Don't free MPI_COMM_WORLD: it doesn't like it.
Add callback per failure and communicator in comm_ft.c
Allow new behavior in the agreement test program.
Add a list of globally known process failures to reduce agreements message overheads in case of ERA.
Rewrite of the ERA to fit the new spec of ERR_PROC_FAILED in agreement. WIP.
ERA seems to work, as long as failures happen before entering the agreement. WIP.
Add some more detailed comment; make the binary tree work; update automatically the list of agreed upon failed ranks; fix a bug in list iterator usage
Implement missing cases, change the way we handle messages send, fix multiple bugs, starts to work better on simple test.
Captured corner case: simultaneous failures triggering dupplicate UP messages. See comment in code.
First working version: passes simpletest on 1024 processes, 16 nodes.
Makes simpleexample less simple. Allow for failures during the agreement. This created race conditions, and changes in the ERA code to handle these races. Fixed many small issues too.
Correctly handle the return from the MPI level functions.
Im not really sure what the meaning of this code was, but things works
Add the benchmark for the agreement. Work in progress.
Be a little paranoid on asserts when looping on agreements: either I should discover new deaths, or I should return success.
If NPROCS == -1 (construction of a normal communicator), one need to allocate the agreed_failed_rank group.
Adapt the test: use the debugging functionality of ERA; DUP can fail because of previous errors that let the previous agreement succeed, so all processes must agree after a dup (and shrink if needed).
Fix the way new_dead messages are merged; Use the agreed_failed_ranks group to provide a consistent output to the caller; add checks in message treatment; add assertion on communicators status
We return object that have been retained in the ERA, so don't assume you can free the group: you must release the object.
Removed an assert that was too strong: when multiple failures happen, they can make the agreement return FAILED without discovering globally a new failure.
Fix bug in list of dead ranks: they must be fully ordered to have linear merge. Add asserts to check that they are fully ordered. Print only merge information when it is not trivial.
Notification of processes that are dead from a long time can happen and should not create an error. TODO: handle opposite race condition.
Have the communicator compute the correct global epoch.
Remove the local epoch hash table from the ERA: the correct logic is now in comm_cid.c. Add high-level debug messages and reorganize verbose printing.
Add two test cases for ERA: simpleagree and agree4ever (not in the makefile, to be removed once debugging is done).
agree4ever stress test
Increase the number of bits that are common in flag to check correctness of agreement.
Remove warning and tune the debug level of some messages to get a quick overview at low ompi_ftmpi_verbose values
Allow multiple failures in agree benchmark
Committed a little early
Fix issue #5 for Binary tree topology. Star and String topologies were already oblivious to the issue. Use a rank translation array to reason on the 'alive' group (c_local_group \ agreed_failed_rank) when computing the parent / children relationship.
Moving forward to a OP-specific operation in the agreement, to get improved shrink performance and reliability.
Implement Long Messages communication through fragmented comm (assuming FIFO ordering of fragments). This is finally needed to handle variable reduction operation, and remove inefficiencies w.r.t. handling of ranks id. More memory pressure, but messages will be much more efficient.
-F 1 introduces a failure, not 0
Replace the logic to decide the return code by an inline solution. Remembers the current acknowledged array of ranks, but don't remember the participation of each child. This is made possible by the fact that we receive all the acknowledge arrays in a single message.
Implemement feature in issue #3: message of variable size, no constant to define the number of acknowledged, newly_discovered dead, and no constant to limit the size of the agreed upon value.
Can still be improved: there are dupplications in the acknowledged / newly_agreed failures list, and the way the message is packetized require to double the memory allocation (temporarily).
Allow for BTL to not be FIFO.
Remove message allocation and free when sending a message: instead, use iovecs to create the fragments.
Partial fix for #9. Push the ompi_op_t / ompi_datatype_t interface higher to the ompi_coll_agreement level, to allow for more performing shrink. Ported this interface to ETA and ERA. Ancillary agreements will be hard to port to this interface, we need to see if we continue supporting them.
Allow ETA (and potentially other agreements) to use the same way to inject failures at critical spots for testing.
Fix an issue if sizeof(msg_t) % 8 != 0, and introduce garbage collection mechanisms in ERA.
Garbage Collection: define functions to synchronize at free and destroy all the remaining agreement information.
Garbage Collection Performance fix.
Remove restriction on the operation that can be applied in the allreduce used in nextcid
Fix the NextCID in the newop strategy: the lists may not be ordered after the reduction.
Change the Shrink method: use the agreement for the allreduce, remove unecessary steps. Remove abusive assert in ERA.
Be less verbose.
no need to include mpiext.h from the ft directory anymore, removing from the examples
Replace the group for agreed_dead_ranks into a sorted rank array in order to remove the O(n^2) computation per agreement.
An explicit representation of the agreement tree is now created (linear time and memory, where the previous algorithm was using a linear memory but comm_size^2 computation to create the translation array), and maintained during the agreement to adapt to failures.
Remove abusive assert: because of the garbage collection mechanism for old agreements, when receiving this DOWN message, it is possible that the corresponding agreement has already been forgotten about.
A faster group_intersection. It might allocate some
extra temporary memory, but it only parses the groups once.
Use the right "opal_enable_debug" #if
Missing a param
Factor the tree building algorithm, to do it only when starting a new agreement after a new failure has been discovered.
Second part of the code init simplification: no more group intersection, copy on write for the tree
Topology Awareness. This code should be compatible with any kind of tree. It allocated temporary memory in O( nb_nodes ), to store a hash table and a temporary tree, builds the tree of representatives following the currently defined tree, then glues back the non-representatives to their representatives and free the temporary memory.
Fix some bugs introduced with the factorization of tree construction; Add an MCA-parameter to dynamically select the tree. Some rare bug still present.
Correct benchagree to reduce measurement noise
Missing i in the STABILIZE printf of benchagree
Previous commit incorrect.
Add the detected dead processes during the Agrement to the list of detected failed processes so that next round of ACK will catch them
Be more verbose when failing.
Merge, avoid looping over the get_value_uint64, and re-enable hierarchy in local trees, when using locality heuristics for tree shapes.
Adding optional rebuild of the agree tree
Do not rebuild the tree when the AFR is dirty but not the tree
Adding a multi-failure injection strategy
A bug when the root dies.
Special case when revoke msg is received during COMM_NEXT_CID
Integrate the simpleagree benchmark in the examples
Start removing unsupported/broken code
While we're at it
Add the agree4ever test in the compiled tests
Implement best CID allocation strategy in the ULFM CID allocator, to enable faster shrink time
Cleanup, and prepare interface for iagree
cleanup and prepare interface for iagree
Split agree call in three steps: prepare, wait, complete, for integration of iagree
Iagree for ERA. This should be it... Please review :)
Import few changes from the 1.6.5 into ULFM.
Most of them are related to IB support.