Problems with Cloudsim

Issue #339 new
Sophisticated Engineering created an issue

We have problems with cloudsim on the portal.

Tests from last night are running very long and have the Status “Deleting Pods”. And some have additionally “Error: Admin Review”.

I’ve send an Email to subt-help.

Is this problem also seen by other teams?

Comments (29)

  1. Arthur Schang

    It seems CloudSim was under a very heavy load last night. Please try resubmitting a couple of your runs.

  2. Hector Escobar

    I just uploaded my images like 30 min ago and got the same Error:Admin Review. Now it changed to “Error: InitializationFailed”. My images run ok with docker-compose, is this due to the loading in the cloud?

  3. Martin Dlouhy

    The same problem (Terminated, Error: InitializationFailed) for all robotika latest simulations (ver55). I would almost change the priority from “major” to “blocker”. We are trying to create workaround for missing messages, and there is no way how to test it now …

  4. Malcolm Stagg

    Does anyone see any update on this? My recent runs seem to have been restarted. Most are pending, one is LaunchingPods. That one only has real-time logs (nice feature btw) for one robot, which shows:

    ROS_MASTER_URI=http://10.46.56.2:11311
    ]2;/home/developer/subt_ws/install/share/subt_ros/launch/x2_description.launch http://10.46.56.2:11311
    No processes to monitor
    shutting down processing monitor...
    ... shutting down processing monitor complete
    

    Not sure that’s a good sign…

    [Update] it failed, admin review

  5. Arthur Schang

    I believe part of the Initialization failure/error messages is based upon submitting runs in rapid succession. Please try resubmitting in a slower manner and space your simulations submissions apart. I don't have a good answer for how long to wait between submissions but waiting to submit subsequent submissions until the first submission is Running should be sufficient. If you're using the CLI and submitting a handful of submissions at once, please add some arbitrarily lengthy sleep (5 minutes is a good spacing) between each submission for a temporary hands-off fix or monitor your submissions on the web portal and submit subsequent simulations after the prior has the status Running.

    This is related to an issue bumping into an AWS limit on CloudSim with subsequent loading of a large number of simulation requests at a single time. We're working on a more permanent fix that does not require competitors to alter their submission workflow.

  6. Sophisticated Engineering reporter

    Two of our tests are meanwhile “running “ . The others are “Terminated GazeboError”

  7. Sarah Kitchen

    @Arthur Schang It appears that there is an auto-restart process that happens in some cases - specifically, I observe in some cases when I receive an error, but the state is not given as Terminated, the run will relaunch after a few hours. This is really nice to have, but could it also be affecting how often we are seeing the Initialization Failures in the last 24 hours?

  8. Arthur Schang

    Sarah, I am not aware of the logic behind the restart process. I will defer to someone else to answer that question. If a batch of runs are all restarted at the same time it will almost certainly result in another initialization error/failure at the moment.

  9. Malcolm Stagg

    I can confirm I had 6 runs restart (all at the same time, though 5 were initially “pending”) and all 6 failed with initialization error/admin review. Now that those are all done I’m going to carefully try just one, hoping for the best…

  10. Malcolm Stagg

    The portal is now displaying “Unknown Error” for me and is not displaying any simulation results. Is anyone else seeing that? I tried logging out and in again.

    I was just going to start a new simulation but maybe I’d better wait a bit first.

    [Edit] Looks ok now, maybe just a temporary server issue

  11. Chris Fotache

    Same here. Don’t worry, it’s still gonna be Jan 30 somewhere on Earth for the next 12 hours, so keep an eye on it, load up on Red Bulls and don’t plan any sleep.

  12. Arthur Schang

    For future circuits, would a submission process that allowed for multiple submissions for the final circuit solution be something that would ease tensions around the deadline? That would allow for a competitor to submit intermediate solutions before submitting their final solution. In this case, if something drastic did happen, your submission would fall back on your intermediate solution.

  13. Sophisticated Engineering reporter

    I had the unknown error yesterday see issue #340. It disappeared after about an hour.

    Allowing multiple submission fotr the final solution would be a really good feature!

  14. Nate Koenig

    A submission may go through many different states, and you should not worry. If we hit an AWS limit, then we'll retry. Same goes for a gazebo crash. Please wait for your email summary.

  15. Sarah Kitchen

    Should we refrain from doing tests with practice runs in CloudSim while scoring for the Urban Circuit is underway? I.e. are we at risk of overloading the system and affecting scores?

  16. Arthur Schang

    You are clear to continue testing. If issues do arise, we will communicate with teams and take steps to ensure all runs are consistently scored.

  17. Martin Dlouhy

    @Arthur Schang are you sure this is good idea? Note, that all solutions degrade under heavy load (see https://bitbucket.org/osrf/subt/issues/261/cloudsim-stops-sending-some-topics#comment-55929641) and I believe that this was also issue during Tunnel Circuit finals. It would be more fair to close testing for the next two weeks (?) period and run the contest solution only. Sequentially. thanks

    p.s. I sent similar question as Sarah to subt-help@ yesterday …

  18. Arthur Schang

    I am aware of the situation and results raised in issue #261. If CloudSim practice runs are to be temporarily discontinued during UC evaluation, a formal announcement or infrastructural block on continued submission of practice runs will be issued.

  19. Log in to comment