PERFORMANCE_transport_stress test times out on gazebo8/default on xenial

Issue #2320 new
Steven Peters
created an issue

Until recently, I was building the gazebo7 branch every night on my trusty machine and running make test in a loop. I recently switched to running gazebo8 on xenial and have noticed that the PERFORMANCE_transport_stress test fails to complete with the gazebo8 / xenial combination.

Furthermore, I have noticed that this test also fails in jenkins with default / xenial. The current oldest build is from Feb 11, 2017, and the most recent build is from July 18, 2017, and the transport stress test failed in both cases.

The last time I ran gazebo7 / trusty was July 11, 2017, and the test passed in 8.667 seconds with the following JUnit result file:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites tests="2" failures="0" disabled="0" errors="0" timestamp="2017-07-12T05:32:52" time="8.667" name="AllTests">
  <testsuite name="TransportStressTest" tests="2" failures="0" disabled="0" errors="0" time="8.667">
    <testcase name="LocalPublish" status="run" time="3.302" classname="TransportStressTest" />
    <testcase name="ManyNodes" status="run" time="5.364" classname="TransportStressTest" />

I just ran the test manually without the make test invocation so that it isn't subject to the timeout, and it took over 30 minutes. I'll try to do some profiling to identify what is causing the slowdown.

Comments (6)

  1. Steven Peters reporter

    Here's the console output from that 26 minute run:

    [==========] Running 2 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 2 tests from TransportStressTest
    [ RUN      ] TransportStressTest.LocalPublish
    [Msg] Waiting for master.
    [Msg] Connected to gazebo master @
    [Msg] Publicized address:
    [Dbg] [] ServerFixture load in 1.1 seconds, timeout after 600 seconds
    [Msg] Publish Complete
    /data_fast/scpeters/ws/tmp/gazebo_build/src/gazebo/test/performance/ Failure
    Expected: (diff.sec) < (g_localPublishMessageCount * 0.0008), actual: 1640 vs 800
    [Msg] Time to publish 1000000 messages = 1640 518320016
    [Dbg] [] ServerFixture::Unload
    AL lib: (WW) alcCloseDevice: Releasing context 0x7f115152d0d0
    [  FAILED  ] TransportStressTest.LocalPublish (1641800 ms)
    [ RUN      ] TransportStressTest.ManyNodes
    [Msg] Waiting for master.
    [Msg] Connected to gazebo master @
    [Msg] Publicized address:
    [Dbg] [] ServerFixture load in 1.2 seconds, timeout after 600 seconds
    [Msg] Publish complete
    /data_fast/scpeters/ws/tmp/gazebo_build/src/gazebo/test/performance/ Failure
    Expected: (pubDiff.sec) <= (2), actual: 12 vs 2
    [Msg] Time to publish 10000 = 12 312717664
    [Msg] Time to receive 10000000 = 12 314618631
    [Dbg] [] ServerFixture::Unload
    AL lib: (WW) alcCloseDevice: Releasing context 0x7f1150206140
    [  FAILED  ] TransportStressTest.ManyNodes (15332 ms)
    [----------] 2 tests from TransportStressTest (1657132 ms total)
    [----------] Global test environment tear-down
    [==========] 2 tests from 1 test case ran. (1657132 ms total)
    [  PASSED  ] 0 tests.
    [  FAILED  ] 2 tests, listed below:
    [  FAILED  ] TransportStressTest.LocalPublish
    [  FAILED  ] TransportStressTest.ManyNodes

    The bulk of the time was spent in the LocalPublish test, and profiling showed that it spent more than 50% of its time allocating and freeing memory.

  2. Steven Peters reporter

    It could be the difference in protobuf version. I did some profiling, and it seems to be spending most of its time copying the image message in order to call Publish

  3. Log in to comment