From 1bcacc64d5f33d674fa50c85ed9a982ade380bd5 Mon Sep 17 00:00:00 2001 From: Richard Carlsson Date: Fri, 16 Dec 2016 11:40:38 +0100 Subject: Add design principles restart intensity howto --- system/doc/design_principles/sup_princ.xml | 63 ++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) (limited to 'system/doc/design_principles/sup_princ.xml') diff --git a/system/doc/design_principles/sup_princ.xml b/system/doc/design_principles/sup_princ.xml index c24177d842..478d1bf714 100644 --- a/system/doc/design_principles/sup_princ.xml +++ b/system/doc/design_principles/sup_princ.xml @@ -175,6 +175,69 @@ SupFlags = #{intensity => MaxR, period => MaxT, ...}

The keys intensity and period are optional in the supervisor flags map. If they are not given, they default to 1 and 5, respectively.

+
+ Tuning the intensity and period +

The default values are 1 restart per 5 seconds. This was chosen to + be safe for most systems, even with deep supervision hierarchies, + but you will probably want to tune the settings for your particular + use case.

+

First, the intensity decides how big bursts of restarts you want + to tolerate. For example, you might want to accept a burst of at + most 5 or 10 attempts, even within the same second, if it results + in a successful restart.

+

Second, you need to consider the sustained failure rate, if + crashes keep happening but not often enough to make the supervisor + give up. If you set intensity to 10 and set the period as low as 1, + the supervisor will allow child processes to keep restarting up to + 10 times per second, forever, filling your logs with crash reports + until someone intervenes manually.

+

You should therefore set the period to be long enough that you can + accept that the supervisor keeps going at that rate. For example, + if you have picked an intensity value of 5, then setting the period + to 30 seconds will give you at most one restart per 6 seconds for + any longer period of time, which means that your logs won't fill up + too quickly, and you will have a chance to observe the failures and + apply a fix.

+

These choices depend a lot on your problem domain. If you don't + have real time monitoring and ability to fix problems quickly, for + example in an embedded system, you might want to accept at most + one restart per minute before the supervisor should give up and + escalate to the next level to try to clear the error automatically. + On the other hand, if it is more important that you keep trying + even at a high failure rate, you might want a sustained rate of as + much as 1-2 restarts per second.

+

Avoiding common mistakes: + + +

Do not forget to consider the burst rate. If you set intensity + to 1 and period to 6, it gives the same sustained error rate as + 5/30 or 10/60, but will not allow even 2 restart attempts in + quick succession. This is probably not what you wanted.

+ + +

Do not set the period to a very high value if you want to + tolerate bursts. If you set intensity to 5 and period to 3600 + (one hour), the supervisor will allow a short burst of 5 + restarts, but then gives up if it sees another single restart + almost an hour later. You probably want to regard those crashes + as separate incidents, so setting the period to 5 or 10 minutes + will be more reasonable.

+
+ +

If your application has multiple levels of supervision, then + do not simply set the restart intensities to the same values on + all levels. Keep in mind that the total number of restarts + (before the top level supervisor gives up and terminates the + application) will be the product of the intensity values of all + the supervisors above the failing child process.

+

For example, if the top level allows 10 restarts, and the next + level also allows 10, a crashing child below that level will be + restarted 100 times, which is probably excessive. Allowing at + most 3 restarts for the top level supervisor might be a better + choice in this case.

+
+

+
-- cgit v1.2.3