diff options
-rw-r--r-- | system/doc/design_principles/sup_princ.xml | 63 |
1 files changed, 63 insertions, 0 deletions
diff --git a/system/doc/design_principles/sup_princ.xml b/system/doc/design_principles/sup_princ.xml index c24177d842..478d1bf714 100644 --- a/system/doc/design_principles/sup_princ.xml +++ b/system/doc/design_principles/sup_princ.xml @@ -175,6 +175,69 @@ SupFlags = #{intensity => MaxR, period => MaxT, ...}</code> <p>The keys <c>intensity</c> and <c>period</c> are optional in the supervisor flags map. If they are not given, they default to <c>1</c> and <c>5</c>, respectively.</p> + <section> + <title>Tuning the intensity and period</title> + <p>The default values are 1 restart per 5 seconds. This was chosen to + be safe for most systems, even with deep supervision hierarchies, + but you will probably want to tune the settings for your particular + use case.</p> + <p>First, the intensity decides how big bursts of restarts you want + to tolerate. For example, you might want to accept a burst of at + most 5 or 10 attempts, even within the same second, if it results + in a successful restart.</p> + <p>Second, you need to consider the sustained failure rate, if + crashes keep happening but not often enough to make the supervisor + give up. If you set intensity to 10 and set the period as low as 1, + the supervisor will allow child processes to keep restarting up to + 10 times per second, forever, filling your logs with crash reports + until someone intervenes manually.</p> + <p>You should therefore set the period to be long enough that you can + accept that the supervisor keeps going at that rate. For example, + if you have picked an intensity value of 5, then setting the period + to 30 seconds will give you at most one restart per 6 seconds for + any longer period of time, which means that your logs won't fill up + too quickly, and you will have a chance to observe the failures and + apply a fix.</p> + <p>These choices depend a lot on your problem domain. If you don't + have real time monitoring and ability to fix problems quickly, for + example in an embedded system, you might want to accept at most + one restart per minute before the supervisor should give up and + escalate to the next level to try to clear the error automatically. + On the other hand, if it is more important that you keep trying + even at a high failure rate, you might want a sustained rate of as + much as 1-2 restarts per second.</p> + <p>Avoiding common mistakes: + <list type="bulleted"> + <item> + <p>Do not forget to consider the burst rate. If you set intensity + to 1 and period to 6, it gives the same sustained error rate as + 5/30 or 10/60, but will not allow even 2 restart attempts in + quick succession. This is probably not what you wanted.</p> + </item> + <item> + <p>Do not set the period to a very high value if you want to + tolerate bursts. If you set intensity to 5 and period to 3600 + (one hour), the supervisor will allow a short burst of 5 + restarts, but then gives up if it sees another single restart + almost an hour later. You probably want to regard those crashes + as separate incidents, so setting the period to 5 or 10 minutes + will be more reasonable.</p> + </item> + <item> + <p>If your application has multiple levels of supervision, then + do not simply set the restart intensities to the same values on + all levels. Keep in mind that the total number of restarts + (before the top level supervisor gives up and terminates the + application) will be the product of the intensity values of all + the supervisors above the failing child process.</p> + <p>For example, if the top level allows 10 restarts, and the next + level also allows 10, a crashing child below that level will be + restarted 100 times, which is probably excessive. Allowing at + most 3 restarts for the top level supervisor might be a better + choice in this case.</p> + </item> + </list></p> + </section> </section> <section> |