From 1bcacc64d5f33d674fa50c85ed9a982ade380bd5 Mon Sep 17 00:00:00 2001
From: Richard Carlsson <richardc@klarna.com>
Date: Fri, 16 Dec 2016 11:40:38 +0100
Subject: Add design principles restart intensity howto

---
 system/doc/design_principles/sup_princ.xml | 63 ++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

(limited to 'system/doc/design_principles/sup_princ.xml')
diff --git a/system/doc/design_principles/sup_princ.xml b/system/doc/design_principles/sup_princ.xml
index c24177d842..478d1bf714 100644
--- a/system/doc/design_principles/sup_princ.xml
+++ b/system/doc/design_principles/sup_princ.xml
@@ -175,6 +175,69 @@ SupFlags = #{intensity => MaxR, period => MaxT, ...}</code>
     <p>The keys <c>intensity</c> and <c>period</c> are optional in the
       supervisor flags map. If they are not given, they default
       to <c>1</c> and <c>5</c>, respectively.</p>
+    <section>
+      <title>Tuning the intensity and period</title>
+      <p>The default values are 1 restart per 5 seconds. This was chosen to
+        be safe for most systems, even with deep supervision hierarchies,
+        but you will probably want to tune the settings for your particular
+        use case.</p>
+      <p>First, the intensity decides how big bursts of restarts you want
+        to tolerate. For example, you might want to accept a burst of at
+        most 5 or 10 attempts, even within the same second, if it results
+        in a successful restart.</p>
+      <p>Second, you need to consider the sustained failure rate, if
+        crashes keep happening but not often enough to make the supervisor
+        give up. If you set intensity to 10 and set the period as low as 1,
+        the supervisor will allow child processes to keep restarting up to
+        10 times per second, forever, filling your logs with crash reports
+        until someone intervenes manually.</p>
+      <p>You should therefore set the period to be long enough that you can
+        accept that the supervisor keeps going at that rate. For example,
+        if you have picked an intensity value of 5, then setting the period
+        to 30 seconds will give you at most one restart per 6 seconds for
+        any longer period of time, which means that your logs won't fill up
+        too quickly, and you will have a chance to observe the failures and
+        apply a fix.</p>
+      <p>These choices depend a lot on your problem domain. If you don't
+        have real time monitoring and ability to fix problems quickly, for
+        example in an embedded system, you might want to accept at most
+        one restart per minute before the supervisor should give up and
+        escalate to the next level to try to clear the error automatically.
+        On the other hand, if it is more important that you keep trying
+        even at a high failure rate, you might want a sustained rate of as
+        much as 1-2 restarts per second.</p>
+      <p>Avoiding common mistakes:
+      <list type="bulleted">
+        <item>
+          <p>Do not forget to consider the burst rate. If you set intensity
+            to 1 and period to 6, it gives the same sustained error rate as
+            5/30 or 10/60, but will not allow even 2 restart attempts in
+            quick succession. This is probably not what you wanted.</p>
+        </item>
+        <item>
+          <p>Do not set the period to a very high value if you want to
+            tolerate bursts. If you set intensity to 5 and period to 3600
+            (one hour), the supervisor will allow a short burst of 5
+            restarts, but then gives up if it sees another single restart
+            almost an hour later. You probably want to regard those crashes
+            as separate incidents, so setting the period to 5 or 10 minutes
+            will be more reasonable.</p>
+        </item>
+        <item>
+          <p>If your application has multiple levels of supervision, then
+            do not simply set the restart intensities to the same values on
+            all levels. Keep in mind that the total number of restarts
+            (before the top level supervisor gives up and terminates the
+            application) will be the product of the intensity values of all
+            the supervisors above the failing child process.</p>
+          <p>For example, if the top level allows 10 restarts, and the next
+            level also allows 10, a crashing child below that level will be
+            restarted 100 times, which is probably excessive. Allowing at
+            most 3 restarts for the top level supervisor might be a better
+            choice in this case.</p>
+        </item>
+      </list></p>
+    </section>
   </section>
 
   <section>
-- 
cgit v1.2.3