Use of RPC_FTCP is not thread-safe
There are several places in mfe.c and midas.c that set the FAST TCP RPC option: rpc_set_option(-1, RPC_OTRANSPORT, RPC_FTCP)
. Unfortunately this is done without having the RPC mutex. This means that another thread can call something like db_get_key()
while the option is still RPC_FTCP. This causes rpc_call()
to return early, and not bother waiting for a response. However rpc_call()
still returns SUCCESS, and so db_get_key()
will also return SUCCESS, even though the KEY structure hasn't been populated! Similar issues will affect many other functions.
I assume this behaviour (all multi-threaded access to the ODB is unsafe on remote connections) is unintentional. I see the issue regularly in one of our unlucky frontends for SuperCDMS.
I think the solution is that everywhere that sets the RPC_FTCP option on the server connection needs to grab the RPC mutex first.
E.g. change
rpc_set_option(-1, RPC_OTRANSPORT, RPC_FTCP);
db_send_changed_records();
rpc_set_option(-1, RPC_OTRANSPORT, RPC_TCP);
to
get_the_rpc_mutex();
rpc_set_option(-1, RPC_OTRANSPORT, RPC_FTCP);
db_send_changed_records();
rpc_set_option(-1, RPC_OTRANSPORT, RPC_TCP);
release_the_rpc_mutex();
Comments (7)
-
-
Ok. It looks simple.
RPC_FTCP tells rpc_call() and rpc_client_call() to skip receiving the RPC reply. It also sets the TCP_FAST bit in the RPC routine_id which tells the RPC handler to skip sending the reply.
The only place where this is used in midas.c is in cm_shutdown() to shutdown a remote client via rpc_client_disconnect() and rpc_client_call(RPC_ID_SHUTDOWN or RPC_ID_EXIT).
The only place it is used elsewhere is mfe.c to speedup db_send_changed_records() by not waiting for the RPC response (one way command).
But this is redundant, inside db_send_changed_records() before making the db_set_record() call, the RPC_FTCP mode is set again.
So I think a) the extra calls to set RPC_FTCP can be removed from mfe.c. b) db_send_changed_records() can be changed to read: if (rpc_is_remote()) { rpc_call(RPC_DB_SET_RECORD|TCP_FAST,...) } else { db_set_record(...) }
Then as a bonus: - change rpc_client_disconnect() to read rpc_client_call(hConn, bShutdown ? (RPC_ID_SHUTDOWN|TCP_FAST) : (RPC_ID_EXIT|TCP_FAST)); - RPC_FTCP becomes unused, remove it and the corresponding rpc_set_option() - rename TCP_FAST to RPC_NO_REPLY
Let's see if Stefan is okey with this.
K.O.
-
Stefan seems to be okey with this. I will implement these changes on a branch. K.O.
-
Implemented on branch feature/RPC_NO_REPLY. https://bitbucket.org/tmidas/midas/commits/branch/feature/RPC_NO_REPLY
Ben, please try this? It should fix your mfe.c multithreading crash.
K.O.
-
reporter Sorry for the delay - BitBucket emails were ending up in junk so I hadn't realised you'd worked on this.
I've tried the new branch and it works great. The problem is fixed, and I don't see any new issues. Thanks!
-
I reviewed KO's modifications and fully approve them. It's a clever idea to put the RPC_NO_REPLY as a flag directly into the routine_id. The branch should be merged.
-
- changed status to resolved
merge d5ce536. K.O.
- Log in to comment
Yikes, what a bug. I never understood this RPC_FTCP business, now I guess I will have to look into it... (Having global settings like this is definitely bad for multithread uses... I much prefer to specify special behaviour per-call exlicitely). K.O.