aboutsummaryrefslogtreecommitdiffstats
path: root/lib/os_mon/src/os_sup.erl
diff options
context:
space:
mode:
authorLuca Favatella <[email protected]>2014-08-26 12:09:54 +0100
committerLuca Favatella <[email protected]>2014-08-29 20:45:40 +0100
commit5ed225662a6e48877a2ac0bef7c080297d64561e (patch)
treec8bd9c9b9c42712be6f4d33d7b91b51bf714a946 /lib/os_mon/src/os_sup.erl
parent83113d9f1df48dc73fbff82705e8aa3fb3af6723 (diff)
downloadotp-5ed225662a6e48877a2ac0bef7c080297d64561e.tar.gz
otp-5ed225662a6e48877a2ac0bef7c080297d64561e.tar.bz2
otp-5ed225662a6e48877a2ac0bef7c080297d64561e.zip
Clarify error for slow `cpu_sup` port init
I noticed that running an R16B03-1 node on an overloaded host produced log entries like the following ones: ``` 2014-08-22 21:52:31 =ERROR REPORT==== Error in process <0.24112.3> on node '[email protected]' with exit value: {{case_clause,{data,4711}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,227}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]} ``` ``` ===== ALIVE Fri Aug 22 21:50:14 CEST 2014 [os_mon] cpu supervisor port (cpu_sup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed ===== ALIVE Fri Aug 22 22:07:46 CEST 2014 ``` I performed a code inspection on the `cpu_sup` module and I concluded that the `case_clause` error shows a small issue in the `cpu_sup` module, that happens when the port used in `cpu_sup` is slow to start - as it may happen on an overloaded node. The `cpu_sup` `gen_server` process keeps in its state the pid of an unlinked process (called "measurement server" - see `cpu_sup:init/1`), in order to do dirty stuff (e.g. reading the filesystem, running OS commands) and start_link-ing & managing the connected process to a port (called "port server" - see `measurement_server_init/0`). So the process organization looks like this: ``` cpu_sup - measurement server - port server - port ``` When the measurement server start_links the port server (see `port_server_start/0`) it sends a `{self(), ?ping}` message to it and expects an answer within 6s, otherwise it returns `{error, timeout}` rather than the pid. This has two issues: * The measurement server keeps `{error, ...}` in its state as if it were a pid - that makes no sense; * A late `{Pid, {data,4711}}` response may arrive in the mailbox of the measurement server, that will believe it to be a request to be processed, causing a `case_clause` error. This commit teaches the measurement server to check the success of the initialization of the port server by matching on the return value of `port_server_start/0` (renamed to `port_server_start_link/0` for the sake of clarity) in order to fail earlier and with an error clearer than `{case_clause,{data,4711}}`. In such case I expect the measurement server to be restarted by the `cpu_sup` `gen_server` (see `handle_call/3`) - as before. BTW It is not clear to me when the `handle_info({'EXIT', _Port, Reason}, State)` may be called (the `cpu_sup` `gen_server` does not link to the measurement server) but I am leaving it.
Diffstat (limited to 'lib/os_mon/src/os_sup.erl')
0 files changed, 0 insertions, 0 deletions