CartToJnt sometimes returns solutions outside of given bounds (caused by shared references and thread safety issues)

Issue #71 resolved
Forrest Green created an issue

Issue

We have a unit test in our manipulation pipeline that verifies IK solutions match FK models to within expected tolerances. It was occasionally failing, and I determined that joint states returned from TRAC_IK::CartToJnt did not produce the given target within the bounds that were passed.

Cause

After much digging I discovered that nl_solver and iksolver each have their own FK solvers, but those solvers are initialized to share the same KDL::Chain object. KDL::Chain has a non-threadsafe internal cache that was causing the two solvers to occasionally return results mixing some joints values with each other. In theory simultaneous reads and writes could return wildly different or even invalid values, however it seems like in practice (perhaps in part because solvers are evaluating similar solutions) results are relatively close to the expected values. This is also only affects some iterations of the search. In our usage, the final result from CartToJnt was typically only slightly outside of the given bounds making it seem like a rounding issue at first glance, however in theory the final error could be arbitrarily large.

It looks like this issue was anticipated in the NLOPT_IK and ChainIkSolverPos_TL classes which each create their own copies of the KDL::Chain and FK solvers. Unfortunately, the original _chain argument is passed to their FK solvers (not the member that is copied from it) and they end up storing a reference to the same object. (That object points to TRAC_IK::chain in current implementation. If a temporary was passed to either constructor that object could get destroyed while FK solvers continued to refer to it which would be bad)

This might have worked originally and been broken when KDL::ChainFkSolverPos_recursive was changed to use a reference to the chain instead of a copy (5 years ago, but after relevant code in trac_ik was written): https://github.com/orocos/orocos_kinematics_dynamics/commit/2a903664c01141b8c84e54c8306f98830d9b878b

To summarize, nl_solver.fksolver.chain, iksolver.fksolver.chain and iksolver.vik_solver.chain, all refer to the same object which is being read and written concurrently from multiple threads causing solver to return junk results.

Suggested solution

Pass the member chain when initializing internal FK solvers instead of the original _chain argument. To make intent clearer, it could be useful to specify this->chain instead of just chain though both should be equivalent in this case. I’ve attached a patch file that makes the following changes:

In trac_ik_lib/src/nlopt_ik.cpp change constructor to start with:

NLOPT_IK::NLOPT_IK(const KDL::Chain& _chain, const KDL::JntArray& _q_min, const KDL::JntArray& _q_max, double _maxtime, double _eps, OptType _type):
  chain(_chain), fksolver(this->chain), maxtime(_maxtime), eps(std::abs(_eps)), TYPE(_type)

In trac_ik_lib/src/kdl_tl.cpp change constructor to start with:

ChainIkSolverPos_TL::ChainIkSolverPos_TL(const Chain& _chain, const JntArray& _q_min, const JntArray& _q_max, double _maxtime, double _eps, bool _random_restart, bool _try_jl_wrap):
  chain(_chain), q_min(_q_min), q_max(_q_max), vik_solver(this->chain), fksolver(this->chain), delta_q(this->chain.getNrOfJoints()),

(The argument to delta_q doesn’t necessarily need to be changed but seems nice from a consistency perspective.)

Note: The initialization order of non-static data members is based on the order in the class definition, not the order in the member initializer list as written in the constructor (though you may get a compiler warning if orders don’t match). This means that to safely refer to another member (i.e. this->chain) inside of the member initializer list that member needs to be defined first in, e.g., the header file and order in the constructor definition isn’t meaningful. This is not a problem for the suggested patch, but could cause a sneaky issues if either of these solver classes were refactored.

Conclusion

Please let me know if there’s anything else I can do to help resolve this issue.
I don’t have a self contained example for reproducing this issue, but could probably put one together if that would be helpful. I could also send you some of the debug code that I used to track down the issue.
I did debugging on Ubuntu 20.04 and ROS Noetic, but I believe this issue should affect a wide range of systems.
Thanks!

Comments (5)

  1. Forrest Green reporter

    I created a fork with a few branches that should help reproduce the issue.

    The feature/add-accuracy-tests branch adds a modified version of the ik_tests example that detects the issue. It’s mostly copy pasted from the existing ik_tests.cpp and should probably be cleaned up if you were going to add it to master (I believe this was related to Noetic compatibility, but I had to modify launch file to swap the xacro.py for just xacro. If running on Kinetic you might need to switch that back. I also needed to install pr2 description which I did via sudo apt-get install ros-noetic-pr2-common):
    https://bitbucket.org/fgreen-diligentrobots/trac_ik/src/2f6cf04666d5825c124d19339e07b6782e1be8d3/?at=feature%2Fadd-accuracy-tests

    I also pushed a possible fix to hotfix/chain-initialization. This branch should be ready to be pulled into master:
    https://bitbucket.org/fgreen-diligentrobots/trac_ik/src/f1c2eb16e5e896d4727b5dc6cd5eb00e84f85e5d/?at=hotfix%2Fchain-initialization

    To test the fix, I merged the hotfix and the new test into a single feature/accuracy-fix-demo branch:
    https://bitbucket.org/fgreen-diligentrobots/trac_ik/src/808e35e7b71cf1f68465c11c05454fb3ed2e7433/?at=feature%2Faccuracy-fix-demo

    Expected output of running roslaunch trac_ik_examples accuracy_test.launch fromfeature/add-accuracy-tests(completion % output got messed up because I changed type of num_samples to an int and didn’t add cast when computing percentage):

    [ INFO] [1621359680.530858387]: Using 7 joints
    [ INFO] [1621359680.533615463]: *** Testing TRAC-IK with 1000 random samples
    [ INFO] [1621359680.538908898]: 0% done
    [ INFO] [1621359680.886096924]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.000130515, 0.00167781, 0.00283975, -0.0097916, -0.0110916, 0.000168132
    [ INFO] [1621359680.992893219]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00295127, -0.00473734, 0.00278524, 0.00783257, 0.00074753, -0.00702595
    [ INFO] [1621359681.083380094]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.000297583, -0.00029467, -0.000658763, 0.0014316, -9.71204e-05, 0.000690127
    [ INFO] [1621359681.108927155]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.0018182, -0.00159702, -0.000687726, 0.000611296, 0.00223248, -0.00356802
    [ INFO] [1621359681.225176178]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00754522, -0.00724632, -0.00217109, 0.00412, -0.0014131, -0.000679593
    [ INFO] [1621359681.317072333]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00413459, -0.00906372, -0.00849037, -0.00126561, 0.000846936, -0.000841033
    [ INFO] [1621359681.332475262]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -1.4397e-11, -9.19348e-11, -1.40304e-11, 2.21508e-06, -0.0346553, 0.128073
    [ INFO] [1621359681.426103774]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -9.31995e-09, -3.71199e-08, 4.79129e-09, 5.04884e-06, -0.0449854, 0.0434024
    [ INFO] [1621359681.573717948]: 0% done
    [ INFO] [1621359681.686052295]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00464819, -0.00999963, -0.00483621, -0.00261382, -0.000937224, 0.000595205
    [ INFO] [1621359681.706406116]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.000329987, -5.13272e-05, -2.05355e-05, -0.000254142, -0.00153197, -0.000254105
    [ INFO] [1621359681.716655820]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -6.26712e-11, 3.83068e-11, -7.81169e-11, -3.40427e-06, -0.0801299, -0.019247
    [ INFO] [1621359681.737123704]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -2.14214e-10, 2.17404e-10, -3.89865e-10, -4.87086e-06, -0.141991, -0.075881
    [ INFO] [1621359681.780529134]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00425847, -0.00190411, -0.00655892, 0.0111184, -0.0042301, 0.00954286
    [ INFO] [1621359681.791033957]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -1.94192e-06, 3.5922e-06, 3.34404e-06, 7.35456e-05, -0.0327803, -0.0528049
    [ INFO] [1621359681.826708669]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -3.89032e-07, -1.51428e-08, -9.9111e-07, -0.000381669, -0.339132, 0.0964276
    [ INFO] [1621359681.980768584]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00348556, -0.00104808, -0.0017686, 0.00244078, -0.000291247, 0.00498266
    [ INFO] [1621359682.098871511]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 3.98976e-05, -0.000106575, -2.76724e-05, 0.00153456, -0.000212752, 0.00303999
    [ INFO] [1621359682.153377110]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -2.00787e-09, 6.36442e-09, -4.35762e-09, -2.09257e-05, -0.114785, -0.139061
    [ INFO] [1621359682.565921166]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.0028337, 0.00560319, 0.0102894, 0.00139961, -5.16799e-07, -9.06684e-08
    [ INFO] [1621359682.602643690]: 0% done
    [ INFO] [1621359682.726207177]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -2.49809e-10, 3.60848e-12, -7.35316e-11, 5.44306e-08, -0.0161748, -0.00431897
    [ INFO] [1621359683.252884940]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -6.99129e-07, 3.32627e-07, -9.06859e-07, -0.000272903, -0.249825, -0.0746328
    [ INFO] [1621359683.355239092]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -2.82546e-09, -3.23672e-09, 1.35708e-09, 1.07414e-07, -0.00909443, 0.016162
    [ INFO] [1621359683.360377744]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.000543254, 0.00118061, -0.00755321, -0.0120719, 0.00418721, 0.00152395
    [ INFO] [1621359683.402985099]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.0115292, -0.0153958, -0.00990972, 0.00187452, 3.46971e-05, -0.000199847
    [ INFO] [1621359683.632512617]: 0% done
    [ INFO] [1621359683.679573458]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00034693, 0.00292237, -0.00103206, 0.00294658, 0.030834, -0.00163692
    [ INFO] [1621359683.684834609]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00119296, -0.000384559, 0.00636925, 0.00794774, -0.00423045, 0.000534917
    [ INFO] [1621359683.730970753]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -1.88015e-08, 6.65678e-09, -1.055e-08, -6.83571e-05, -0.229343, -0.0544465
    [ INFO] [1621359683.882347052]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00299882, -0.00923753, -0.000301827, 0.00709416, -0.00163897, -0.0203279
    [ INFO] [1621359683.951818824]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00526931, -0.00436756, -0.00997993, -0.000622313, -0.00285304, -0.000996443
    [ INFO] [1621359683.993124785]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -8.48957e-11, 1.24504e-11, -3.04387e-12, -7.93566e-07, -0.0190936, -0.0573093
    [ INFO] [1621359684.214485602]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.0100013, 0.00584962, -0.00127719, 0.00324574, 0.00648986, -0.00208104
    [ INFO] [1621359684.301384922]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00272662, 0.00916738, -0.00999164, -0.000295948, -0.00200103, -0.000469332
    [ INFO] [1621359684.425399821]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00858696, -0.00998656, -0.00976393, 0.00101155, -0.00183589, 0.0013062
    [ INFO] [1621359684.441373014]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00823022, -0.00983322, 0.0095643, -0.000635101, 0.0026309, 0.00242592
    [ INFO] [1621359684.545759411]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00176324, -0.00420518, 0.00088341, 0.00440586, -0.000851601, -0.000255639
    [ INFO] [1621359684.592059532]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00988901, -0.00178981, -0.00383817, 0.0013105, -0.00306774, -0.00209313
    [ INFO] [1621359684.669432612]: 0% done
    [ INFO] [1621359684.705235649]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.0103209, -0.000155379, -0.00949502, 0.000131215, -1.24085e-05, -0.000796623
    [ INFO] [1621359684.730756169]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -8.95785e-07, 1.28947e-06, -2.96719e-06, -6.74488e-06, -0.0308204, 0.0143409
    [ INFO] [1621359685.022539185]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00760392, 0.00891699, 0.0068126, 0.00172978, -0.00133551, -0.00148961
    [ INFO] [1621359685.148433565]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 3.77506e-07, 1.08707e-06, 1.81176e-06, -2.37715e-06, 0.00748123, -0.00947792
    [ INFO] [1621359685.212597605]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00705778, 0.00192035, -0.00929543, 0.000485687, 0.000474038, 0.00106924
    [ INFO] [1621359685.306300244]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00141394, 0.00401239, -0.00636189, 0.000303582, 0.0030584, 0.00481037
    [ INFO] [1621359685.343308101]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -1.26085e-09, 6.03029e-10, 4.86315e-11, -3.43893e-05, -0.107399, -0.0427835
    [ INFO] [1621359685.402218066]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00157358, -2.0234e-05, -0.00174841, 0.00261556, 0.000263408, 0.00235063
    [ INFO] [1621359685.438074816]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -2.65205e-06, -2.13216e-05, 7.899e-05, -0.000183758, 0.070289, -0.0126507
    [ INFO] [1621359685.449196456]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: 0.00465665, -0.000615977, -0.00777632, -0.00091093, 0.00155984, 0.000520935
    [ INFO] [1621359685.612604753]: Solution was out of target bounds!
     target_error: 0.01, 0.01, 0.01, 0.001, 0.001, 0.001
     actual_error: -0.00783572, -0.00597617, -0.0139031, -0.00698134, 0.00805054, 0.00442015
    [ INFO] [1621359685.699898339]: TRAC-IK found 988 solutions (98.8%)
    [ INFO] [1621359685.699962405]: TRAC-IK found 941 solutions (94.1%) within given tolerances
    [trac_ik_tests-2] process has finished cleanly
    

    Expected output of running roslaunch trac_ik_examples accuracy_test.launch from feature/accuracy-fix-demo(completion % output still messed up because it’s the same test):

    [ INFO] [1621359144.824879651]: Using 7 joints
    [ INFO] [1621359144.826136102]: *** Testing TRAC-IK with 1000 random samples
    [ INFO] [1621359144.831494786]: 0% done
    [ INFO] [1621359145.853058954]: 0% done
    [ INFO] [1621359146.872913405]: 0% done
    [ INFO] [1621359147.893346821]: 0% done
    [ INFO] [1621359148.913396566]: 0% done
    [ INFO] [1621359149.918740968]: 0% done
    [ INFO] [1621359150.427197776]: TRAC-IK found 998 solutions (99.8%)
    [ INFO] [1621359150.427342698]: TRAC-IK found 998 solutions (99.8%) within given tolerances
    

    Note: The test I added uses solvetype == TRAC_IK::Distance which I’d thought might make errors more frequent, but it seems like failures also occur when using the default Speed mode. It might be preferable to use Speed for any automated testing pipeline.

    I hope that helps! Let me know if you run into any issues.

  2. Log in to comment