CartToJnt sometimes returns solutions outside of given bounds (caused by shared references and thread safety issues)

Issue

We have a unit test in our manipulation pipeline that verifies IK solutions match FK models to within expected tolerances. It was occasionally failing, and I determined that joint states returned from TRAC_IK::CartToJnt did not produce the given target within the bounds that were passed.

Cause

After much digging I discovered that nl_solver and iksolver each have their own FK solvers, but those solvers are initialized to share the same KDL::Chain object. KDL::Chain has a non-threadsafe internal cache that was causing the two solvers to occasionally return results mixing some joints values with each other. In theory simultaneous reads and writes could return wildly different or even invalid values, however it seems like in practice (perhaps in part because solvers are evaluating similar solutions) results are relatively close to the expected values. This is also only affects some iterations of the search. In our usage, the final result from CartToJnt was typically only slightly outside of the given bounds making it seem like a rounding issue at first glance, however in theory the final error could be arbitrarily large.

It looks like this issue was anticipated in the NLOPT_IK and ChainIkSolverPos_TL classes which each create their own copies of the KDL::Chain and FK solvers. Unfortunately, the original _chain argument is passed to their FK solvers (not the member that is copied from it) and they end up storing a reference to the same object. (That object points to TRAC_IK::chain in current implementation. If a temporary was passed to either constructor that object could get destroyed while FK solvers continued to refer to it which would be bad)

This might have worked originally and been broken when KDL::ChainFkSolverPos_recursive was changed to use a reference to the chain instead of a copy (5 years ago, but after relevant code in trac_ik was written): https://github.com/orocos/orocos_kinematics_dynamics/commit/2a903664c01141b8c84e54c8306f98830d9b878b

To summarize, nl_solver.fksolver.chain, iksolver.fksolver.chain and iksolver.vik_solver.chain, all refer to the same object which is being read and written concurrently from multiple threads causing solver to return junk results.

Suggested solution

Pass the member chain when initializing internal FK solvers instead of the original _chain argument. To make intent clearer, it could be useful to specify this->chain instead of just chain though both should be equivalent in this case. I’ve attached a patch file that makes the following changes:

In trac_ik_lib/src/nlopt_ik.cpp change constructor to start with:

NLOPT_IK::NLOPT_IK(const KDL::Chain& _chain, const KDL::JntArray& _q_min, const KDL::JntArray& _q_max, double _maxtime, double _eps, OptType _type):
  chain(_chain), fksolver(this->chain), maxtime(_maxtime), eps(std::abs(_eps)), TYPE(_type)

In trac_ik_lib/src/kdl_tl.cpp change constructor to start with:

ChainIkSolverPos_TL::ChainIkSolverPos_TL(const Chain& _chain, const JntArray& _q_min, const JntArray& _q_max, double _maxtime, double _eps, bool _random_restart, bool _try_jl_wrap):
  chain(_chain), q_min(_q_min), q_max(_q_max), vik_solver(this->chain), fksolver(this->chain), delta_q(this->chain.getNrOfJoints()),

(The argument to delta_q doesn’t necessarily need to be changed but seems nice from a consistency perspective.)

Note: The initialization order of non-static data members is based on the order in the class definition, not the order in the member initializer list as written in the constructor (though you may get a compiler warning if orders don’t match). This means that to safely refer to another member (i.e. this->chain) inside of the member initializer list that member needs to be defined first in, e.g., the header file and order in the constructor definition isn’t meaningful. This is not a problem for the suggested patch, but could cause a sneaky issues if either of these solver classes were refactored.

Conclusion

Please let me know if there’s anything else I can do to help resolve this issue.
I don’t have a self contained example for reproducing this issue, but could probably put one together if that would be helpful. I could also send you some of the debug code that I used to track down the issue.
I did debugging on Ubuntu 20.04 and ROS Noetic, but I believe this issue should affect a wide range of systems.
Thanks!

Issue

Cause

Suggested solution

Conclusion

Comments (5)