IDS training may lead to memory corruption

Issue #32 resolved
created an issue

If the IDS learning phase exceeds the time limit for the first depth step of a training state, memory will get corrupted, which may lead to a segfault later on in the search.

This bug appeared in 1 of 6000 runs on the grid (with IPPC2011 as configuration).

The bug is reproducible with ./prost-debug ../../testbed/benchmarks/ippc-all/prost/game_of_life_inst_mdp__10 [PROST -s 1 -se [IPPC2011 -init [Expand -h [IDS -st 0.0000002]]]] This will however not lead to a segfault.

I was able to reproduce the segfault on my machine with a small hack in, which reproduces the behaviour of the grid. I attached the modified as well as the game of life problem file for which the segfault occurs (of course this bug only appeared with the specific training set of the grid file and not with the training set generated on my machine...).

With these files the following command should lead to a segfault in the first few search steps: ./prost game_of_life_inst_mdp__10 [PROST -s 1 -se [IPPC2011]]

I've tracked down the bug and the reason is that if the timeout already occurs for the first search step (i.e. depth 2) the elapsedTime vector will be resized with 2 and maxSearchDepth is set to 1 (line 269). If the next training state is selected, estimateQValues starts a DFS with a depth of 2 (line 201/202). MoreIterations is called with a depth of 2 and will call line 264: elapsedTime[stepsToGo].push_back(time);, where stepsToGo is 2, but the size of elapsedTime is 2 (because of elapsedTime.resize(2)), i.e. an out of bounds call.

I think the memory corruption only shows its effect (at least the segfault) when some caches grow too large and are rehashed and specific cache indices are called afterwards, which is why the debug version was not able to detect this segfault (this was fun to find out).

Comments (4)

  1. tkeller repo owner

    The problem can be solved by running a depth first search from stepsToGo == 1 instead of 2. However, this is currently not supported by the DepthFirstSearch class.

    As all search engines should be such that they can be executed by themselves as well (and not only as a heuristic for THTS), we decided to change this behaviour, which should also fix the bug.

  2. tkeller repo owner

    I could not reproduce the bug with the altered DFS, but the violated assertion with the debug version.

    Code looks fine so we can merge and close this issue.

  3. Log in to comment