13 files changed, 309 insertions, 229 deletions
diff --git a/system/doc/efficiency_guide/advanced.xml b/system/doc/efficiency_guide/advanced.xml
index e1760d0ded..7f719849cc 100644
--- a/system/doc/efficiency_guide/advanced.xml
+++ b/system/doc/efficiency_guide/advanced.xml
@@ -151,7 +151,7 @@
       <row>
         <cell>Processes</cell>
         <cell>The maximum number of simultaneously alive Erlang processes
-	is by default 32,768. This limit can be configured at startup.
+	is by default 262,144. This limit can be configured at startup.
 	For more information, see the
 	<seealso marker="erts:erl#max_processes"><c>+P</c></seealso>
 	command-line flag in the
@@ -264,21 +264,26 @@
       </row>
       <row>
         <cell><marker id="unique_integers"/>Unique Integers on a Runtime System Instance</cell>
-        <cell>There are two types of unique integers both created using the
-        <seealso marker="erts:erlang#unique_integer/1">erlang:unique_integer()</seealso>
-        BIF. Unique integers created:
-         <taglist>
-	  <tag>with the <c>monotonic</c> modifier</tag>
-	  <item>consist of a set of <c>2⁶⁴ - 1</c> unique integers.</item>
-	  <tag>without the <c>monotonic</c> modifier</tag>
-	  <item>consist of a set of <c>2⁶⁴ - 1</c> unique integers per scheduler
-	  thread and a set of <c>2⁶⁴ - 1</c> unique integers shared by
-	  other threads. That is the total amount of unique integers without
-	  the <c>monotonic</c> modifier is <c>(NoSchedulers + 1) * (2⁶⁴ - 1)</c></item>
-        </taglist>
-      If a unique integer is created each nano second, unique integers
-      will at earliest be reused after more than 584 years. That is, for
-	the foreseeable future they are unique enough.</cell>
+        <cell>
+	    There are two types of unique integers both created using the
+            <seealso marker="erts:erlang#unique_integer/1">erlang:unique_integer()</seealso>
+            BIF:
+	    <br/><br/>
+	    <em>1.</em> Unique integers created <em>with</em> the
+	    <c>monotonic</c> modifier consist of a set of <c>2⁶⁴ - 1</c>
+	    unique integers.
+	    <br/><br/>
+	    <em>2.</em> Unique integers created <em>without</em> the
+	    <c>monotonic</c> modifier consist of a set of <c>2⁶⁴ - 1</c>
+	    unique integers per scheduler thread and a set of <c>2⁶⁴ - 1</c>
+	    unique integers shared by other threads. That is, the total
+	    amount of unique integers without the <c>monotonic</c> modifier
+	    is <c>(NoSchedulers + 1) × (2⁶⁴ - 1)</c>.
+	    <br/><br/>
+	    If a unique integer is created each nano second, unique integers
+	    will at earliest be reused after more than 584 years. That is, for
+	    the foreseeable future they are unique enough.
+	</cell>
       </row>	
       <tcaption>System Limits</tcaption>
     </table>
diff --git a/system/doc/efficiency_guide/bench.erl b/system/doc/efficiency_guide/bench.erl
index 1f60e858f6..a1be24b051 100644
--- a/system/doc/efficiency_guide/bench.erl
+++ b/system/doc/efficiency_guide/bench.erl
@@ -355,7 +355,7 @@ create_html_report(ResultList) ->
 
     {ok, OutputFile} = file:open("index.html", [write]),
 
-    %% Create the begining of the result html-file.
+    %% Create the beginning of the result html-file.
     Head = Title = "Benchmark Results",
     io:put_chars(OutputFile, "<html>\n"),
     io:put_chars(OutputFile, "<head>\n"),
diff --git a/system/doc/efficiency_guide/binaryhandling.xml b/system/doc/efficiency_guide/binaryhandling.xml
index 0295d18644..19f40c9abe 100644
--- a/system/doc/efficiency_guide/binaryhandling.xml
+++ b/system/doc/efficiency_guide/binaryhandling.xml
@@ -5,7 +5,7 @@
   <header>
     <copyright>
       <year>2007</year>
-      <year>2016</year>
+      <year>2017</year>
       <holder>Ericsson AB, All Rights Reserved</holder>
     </copyright>
     <legalnotice>
@@ -32,12 +32,9 @@
     <file>binaryhandling.xml</file>
   </header>
 
-  <p>In R12B, the most natural way to construct and match binaries is
-  significantly faster than in earlier releases.</p>
+  <p>Binaries can be efficiently built in the following way:</p>
 
-  <p>To construct a binary, you can simply write as follows:</p>
-
-  <p><em>DO</em> (in R12B) / <em>REALLY DO NOT</em> (in earlier releases)</p>
+  <p><em>DO</em></p>
   <code type="erl"><![CDATA[
 my_list_to_binary(List) ->
     my_list_to_binary(List, <<>>).
@@ -47,21 +44,13 @@ my_list_to_binary([H|T], Acc) ->
 my_list_to_binary([], Acc) ->
     Acc.]]></code>  
 
-  <p>In releases before R12B, <c>Acc</c> is copied in every iteration.
-  In R12B, <c>Acc</c> is copied only in the first iteration and extra
-  space is allocated at the end of the copied binary. In the next iteration,
-  <c>H</c> is written into the extra space. When the extra space runs out,
-  the binary is reallocated with more extra space. The extra space allocated
-  (or reallocated) is twice the size of the
-  existing binary data, or 256, whichever is larger.</p>
-
-  <p>The most natural way to match binaries is now the fastest:</p>
+  <p>Binaries can be efficiently matched like this:</p>
 
-  <p><em>DO</em> (in R12B)</p>
+  <p><em>DO</em></p>
   <code type="erl"><![CDATA[
 my_binary_to_list(<<H,T/binary>>) ->
     [H|my_binary_to_list(T)];
-my_binary_to_list(<<>>) -> [].]]></code>  
+my_binary_to_list(<<>>) -> [].]]></code>
 
   <section>
     <title>How Binaries are Implemented</title>
@@ -138,10 +127,7 @@ my_binary_to_list(<<>>) -> [].]]></code>
       pointer to the binary data. For each field that is matched out of
       a binary, the position in the match context is incremented.</p>
 
-    <p>In R11B, a match context was only used during a binary matching
-    operation.</p>
-
-    <p>In R12B, the compiler tries to avoid generating code that
+    <p>The compiler tries to avoid generating code that
     creates a sub binary, only to shortly afterwards create a new match
     context and discard the sub binary. Instead of creating a sub binary,
     the match context is kept.</p>
@@ -155,7 +141,7 @@ my_binary_to_list(<<>>) -> [].]]></code>
 
   <section>
     <title>Constructing Binaries</title>
-    <p>In R12B, appending to a binary or bitstring
+    <p>Appending to a binary or bitstring
     is specially optimized by the <em>runtime system</em>:</p>
 
   <code type="erl"><![CDATA[
@@ -292,7 +278,7 @@ Bin = <<Bin1,...>>  %% Bin1 will be COPIED
 
     <p>Let us revisit the example in the beginning of the previous section:</p>
 
-  <p><em>DO</em> (in R12B)</p>
+  <p><em>DO</em></p>
   <code type="erl"><![CDATA[
 my_binary_to_list(<<H,T/binary>>) ->
     [H|my_binary_to_list(T)];
@@ -304,15 +290,14 @@ my_binary_to_list(<<>>) -> [].]]></code>
   byte of the binary. 1 byte is matched out and the match context
   is updated to point to the second byte in the binary.</p>
 
-  <p>In R11B, at this point a
-  <seealso marker="#sub_binary">sub binary</seealso>
-  would be created. In R12B,
-  the compiler sees that there is no point in creating a sub binary,
-  because there will soon be a call to a function (in this case,
+  <p>At this point it would make sense to create a
+  <seealso marker="#sub_binary">sub binary</seealso>,
+  but in this particular example the compiler sees that
+  there will soon be a call to a function (in this case,
   to <c>my_binary_to_list/1</c> itself) that immediately will
   create a new match context and discard the sub binary.</p>
 
-  <p>Therefore, in R12B, <c>my_binary_to_list/1</c> calls itself
+  <p>Therefore <c>my_binary_to_list/1</c> calls itself
   with the match context instead of with a sub binary. The instruction
   that initializes the matching operation basically does nothing
   when it sees that it was passed a match context instead of a binary.</p>
@@ -321,34 +306,10 @@ my_binary_to_list(<<>>) -> [].]]></code>
   the match context will simply be discarded (removed in the next
   garbage collection, as there is no longer any reference to it).</p>
 
-  <p>To summarize, <c>my_binary_to_list/1</c> in R12B only needs to create
-  <em>one</em> match context and no sub binaries. In R11B, if the binary
-  contains <em>N</em> bytes, <em>N+1</em> match contexts and <em>N</em>
-  sub binaries are created.</p>
-
-  <p>In R11B, the fastest way to match binaries is as follows:</p>
+  <p>To summarize, <c>my_binary_to_list/1</c> only needs to create
+  <em>one</em> match context and no sub binaries.</p>
 
-  <p><em>DO NOT</em> (in R12B)</p>
-  <code type="erl"><![CDATA[
-my_complicated_binary_to_list(Bin) ->
-    my_complicated_binary_to_list(Bin, 0).
-
-my_complicated_binary_to_list(Bin, Skip) ->
-    case Bin of
-	<<_:Skip/binary,Byte,_/binary>> ->
-	    [Byte|my_complicated_binary_to_list(Bin, Skip+1)];
-	<<_:Skip/binary>> ->
-	    []
-    end.]]></code>
-
-  <p>This function cleverly avoids building sub binaries, but it cannot
-  avoid building a match context in each recursion step.
-  Therefore, in both R11B and R12B,
-  <c>my_complicated_binary_to_list/1</c> builds <em>N+1</em> match
-  contexts. (In a future Erlang/OTP release, the compiler might be able
-  to generate code that reuses the match context.)</p>
-
-  <p>Returning to <c>my_binary_to_list/1</c>, notice that the match context
+  <p>Notice that the match context in <c>my_binary_to_list/1</c>
   was discarded when the entire binary had been traversed. What happens if
   the iteration stops before it has reached the end of the binary? Will
   the optimization still work?</p>
@@ -544,5 +505,15 @@ count3(<<>>, Count) -> Count.]]></code>
   not matched out.</p>
   </section>
   </section>
+
+  <section>
+    <title>Historical Note</title>
+
+    <p>Binary handling was significantly improved in R12B. Because
+    code that was efficient in R11B might not be efficient in R12B,
+    and vice versa, earlier revisions of this Efficiency Guide contained
+    some information about binary handling in R11B.</p>
+  </section>
+
 </chapter>
 
diff --git a/system/doc/efficiency_guide/commoncaveats.xml b/system/doc/efficiency_guide/commoncaveats.xml
index ecfeff0349..b41ffc3902 100644
--- a/system/doc/efficiency_guide/commoncaveats.xml
+++ b/system/doc/efficiency_guide/commoncaveats.xml
@@ -4,7 +4,7 @@
 <chapter>
   <header>
     <copyright>
-      <year>2001</year><year>2016</year>
+      <year>2001</year><year>2017</year>
       <holder>Ericsson AB. All Rights Reserved.</holder>
     </copyright>
     <legalnotice>
@@ -148,10 +148,10 @@ multiple_setelement(T0) ->
 
     <p><c>size/1</c> returns the size for both tuples and binaries.</p>
 
-    <p>Using the new BIFs <c>tuple_size/1</c> and <c>byte_size/1</c>, introduced
-    in R12B, gives the compiler and the runtime system more opportunities for
-    optimization. Another advantage is that the new BIFs can help Dialyzer to
-    find more bugs in your program.</p>
+    <p>Using the BIFs <c>tuple_size/1</c> and <c>byte_size/1</c>
+    gives the compiler and the runtime system more opportunities for
+    optimization. Another advantage is that the BIFs give Dialyzer more
+    type information.</p>
   </section>
 
   <section>
diff --git a/system/doc/efficiency_guide/functions.xml b/system/doc/efficiency_guide/functions.xml
index 4a8248e65c..0a8ee7eb34 100644
--- a/system/doc/efficiency_guide/functions.xml
+++ b/system/doc/efficiency_guide/functions.xml
@@ -4,7 +4,7 @@
 <chapter>
   <header>
     <copyright>
-      <year>2001</year><year>2016</year>
+      <year>2001</year><year>2017</year>
       <holder>Ericsson AB. All Rights Reserved.</holder>
     </copyright>
     <legalnotice>
@@ -65,7 +65,7 @@ atom_map1(six) -> 6.</code>
      thus, quite efficient even if there are many values) to select which
      one of the first three clauses to execute (if any).</item>
 
-     <item>>If none of the first three clauses match, the fourth clause
+     <item>If none of the first three clauses match, the fourth clause
      match as a variable always matches.</item>
 
      <item>If the guard test <c>is_integer(Int)</c> succeeds, the fourth
@@ -183,15 +183,6 @@ explicit_map_pairs(Map, Xs0, Ys0) ->
        A fun contains an (indirect) pointer to the function that implements
        the fun.</p>
 
-       <warning><p><em>Tuples are not fun(s)</em>.
-       A "tuple fun", <c>{Module,Function}</c>, is not a fun.
-       The cost for calling a "tuple fun" is similar to that
-       of <c>apply/3</c> or worse.
-       Using "tuple funs" is <em>strongly discouraged</em>,
-       as they might not be supported in a future Erlang/OTP release,
-       and because there exists a superior alternative from R10B,
-       namely the <c>fun Module:Function/Arity</c> syntax.</p></warning>
-
        <p><c>apply/3</c> must look up the code for the function to execute
        in a hash table. It is therefore always slower than a
        direct call or a fun call.</p>
diff --git a/system/doc/efficiency_guide/introduction.xml b/system/doc/efficiency_guide/introduction.xml
index ca4a41c798..dca2dec95e 100644
--- a/system/doc/efficiency_guide/introduction.xml
+++ b/system/doc/efficiency_guide/introduction.xml
@@ -4,7 +4,7 @@
 <chapter>
   <header>
     <copyright>
-      <year>2001</year><year>2016</year>
+      <year>2001</year><year>2017</year>
       <holder>Ericsson AB. All Rights Reserved.</holder>
     </copyright>
     <legalnotice>
@@ -46,14 +46,6 @@
     to find out where the performance bottlenecks are and optimize only the
     bottlenecks. Let other code stay as clean as possible.</p>
 
-    <p>Fortunately, compiler and runtime optimizations introduced in
-    Erlang/OTP R12B makes it easier to write code that is both clean and
-    efficient. For example, the ugly workarounds needed in R11B and earlier
-    releases to get the most speed out of binary pattern matching are
-    no longer necessary. In fact, the ugly code is slower
-    than the clean code (because the clean code has become faster, not
-    because the uglier code has become slower).</p>
-    
     <p>This Efficiency Guide cannot really teach you how to write efficient
     code. It can give you a few pointers about what to avoid and what to use,
     and some understanding of how certain language features are implemented.
diff --git a/system/doc/efficiency_guide/listhandling.xml b/system/doc/efficiency_guide/listhandling.xml
index 2ebc877820..4f2497359d 100644
--- a/system/doc/efficiency_guide/listhandling.xml
+++ b/system/doc/efficiency_guide/listhandling.xml
@@ -4,7 +4,7 @@
 <chapter>
   <header>
     <copyright>
-      <year>2001</year><year>2016</year>
+      <year>2001</year><year>2017</year>
       <holder>Ericsson AB. All Rights Reserved.</holder>
     </copyright>
     <legalnotice>
@@ -90,7 +90,7 @@ tail_recursive_fib(N, Current, Next, Fibs) ->
     <p>Lists comprehensions still have a reputation for being slow.
     They used to be implemented using funs, which used to be slow.</p>
 
-    <p>In recent Erlang/OTP releases (including R12B), a list comprehension:</p>
+    <p>A list comprehension:</p>
 
     <code type="erl"><![CDATA[
 [Expr(E) || E <- List]]]></code>
@@ -102,7 +102,7 @@ tail_recursive_fib(N, Current, Next, Fibs) ->
     [Expr(E)|'lc^0'(Tail, Expr)];
 'lc^0'([], _Expr) -> [].</code>
 
-    <p>In R12B, if the result of the list comprehension will <em>obviously</em>
+    <p>If the result of the list comprehension will <em>obviously</em>
     not be used, a list will not be constructed. For example, in this code:</p>
     
     <code type="erl"><![CDATA[
@@ -131,6 +131,14 @@ some_function(...),
     'lc^0'(Tail, Expr);
 'lc^0'([], _Expr) -> [].</code>
 
+    <p>The compiler also understands that assigning to '_' means that
+    the value will not used. Therefore, the code in the following example
+    will also be optimized:</p>
+
+    <code type="erl"><![CDATA[
+_ = [io:put_chars(E) || E <- List],
+ok.]]></code>
+
   </section>
 
   <section>
@@ -209,11 +217,11 @@ some_function(...),
   <section>
     <title>Recursive List Functions</title>
 
-    <p>In Section 7.2, the following myth was exposed:
+    <p>In section about myths, the following myth was exposed:
     <seealso marker="myths#tail_recursive">Tail-Recursive Functions
     are Much Faster Than Recursive Functions</seealso>.</p>
 
-    <p>To summarize, in R12B there is usually not much difference between
+    <p>There is usually not much difference between
     a body-recursive list function and tail-recursive function that reverses
     the list at the end. Therefore, concentrate on writing beautiful code
     and forget about the performance of your list functions. In the
diff --git a/system/doc/efficiency_guide/myths.xml b/system/doc/efficiency_guide/myths.xml
index 5d3ad78b23..778cd06c09 100644
--- a/system/doc/efficiency_guide/myths.xml
+++ b/system/doc/efficiency_guide/myths.xml
@@ -24,7 +24,7 @@
   The Initial Developer of the Original Code is Ericsson AB.
     </legalnotice>
 
-    <title>The Eight Myths of Erlang Performance</title>
+    <title>The Seven Myths of Erlang Performance</title>
     <prepared>Bjorn Gustavsson</prepared>
     <docno></docno>
     <date>2007-11-10</date>
@@ -35,80 +35,33 @@
   <marker id="myths"></marker>
   <p>Some truths seem to live on well beyond their best-before date,
   perhaps because "information" spreads faster from person-to-person
-  than a single release note that says, for example, that funs
-  have become faster.</p>
+  than a single release note that says, for example, that body-recursive
+  calls have become faster.</p>
 
   <p>This section tries to kill the old truths (or semi-truths) that have
   become myths.</p>
 
   <section>
-    <title>Myth: Funs are Slow</title>
-    <p>Funs used to be very slow, slower than <c>apply/3</c>.
-    Originally, funs were implemented using nothing more than
-    compiler trickery, ordinary tuples, <c>apply/3</c>, and a great
-    deal of ingenuity.</p>
-
-    <p>But that is history. Funs was given its own data type
-    in R6B and was further optimized in R7B.
-    Now the cost for a fun call falls roughly between the cost for a call
-    to a local function and <c>apply/3</c>.</p>
-  </section>
-
-  <section>
-    <title>Myth: List Comprehensions are Slow</title>
-
-    <p>List comprehensions used to be implemented using funs, and in the
-    old days funs were indeed slow.</p>
-
-    <p>Nowadays, the compiler rewrites list comprehensions into an ordinary
-    recursive function. Using a tail-recursive function with
-    a reverse at the end would be still faster. Or would it?
-    That leads us to the next myth.</p>
-  </section>
-
-  <section>
     <title>Myth: Tail-Recursive Functions are Much Faster
     Than Recursive Functions</title>
 
     <p><marker id="tail_recursive"></marker>According to the myth,
-    recursive functions leave references
-    to dead terms on the stack and the garbage collector has to copy
-    all those dead terms, while tail-recursive functions immediately
-    discard those terms.</p>
-
-    <p>That used to be true before R7B. In R7B, the compiler started
-    to generate code that overwrites references to terms that will never
-    be used with an empty list, so that the garbage collector would not
-    keep dead values any longer than necessary.</p>
-
-    <p>Even after that optimization, a tail-recursive function is
-    still most of the times faster than a body-recursive function. Why?</p>
-
-    <p>It has to do with how many words of stack that are used in each
-    recursive call. In most cases, a recursive function uses more words
-    on the stack for each recursion than the number of words a tail-recursive
-    would allocate on the heap. As more memory is used, the garbage
-    collector is invoked more frequently, and it has more work traversing
-    the stack.</p>
-
-    <p>In R12B and later releases, there is an optimization that
-    in many cases reduces the number of words used on the stack in
-    body-recursive calls. A body-recursive list function and a
-    tail-recursive function that calls <seealso
-    marker="stdlib:lists#reverse/1">lists:reverse/1</seealso> at
-    the end will use the same amount of memory.
-    <c>lists:map/2</c>, <c>lists:filter/2</c>, list comprehensions,
-    and many other recursive functions now use the same amount of space
-    as their tail-recursive equivalents.</p>
-
-    <p>So, which is faster?
-    It depends. On Solaris/Sparc, the body-recursive function seems to
-    be slightly faster, even for lists with a lot of elements. On the x86
-    architecture, tail-recursion was up to about 30% faster.</p>
-
-    <p>So, the choice is now mostly a matter of taste. If you really do need
-    the utmost speed, you must <em>measure</em>. You can no longer be
-    sure that the tail-recursive list function always is the fastest.</p>
+    using a tail-recursive function that builds a list in reverse
+    followed by a call to <c>lists:reverse/1</c> is faster than
+    a body-recursive function that builds the list in correct order;
+    the reason being that body-recursive functions use more memory than
+    tail-recursive functions.</p>
+
+    <p>That was true to some extent before R12B. It was even more true
+    before R7B. Today, not so much. A body-recursive function
+    generally uses the same amount of memory as a tail-recursive
+    function. It is generally not possible to predict whether the
+    tail-recursive or the body-recursive version will be
+    faster. Therefore, use the version that makes your code cleaner
+    (hint: it is usually the body-recursive version).</p>
+
+    <p>For a more thorough discussion about tail and body recursion,
+    see <url href="http://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html">Erlang's Tail Recursion is Not a Silver Bullet</url>.</p>
 
     <note><p>A tail-recursive function that does not need to reverse the
     list at the end is faster than a body-recursive function,
@@ -199,6 +152,29 @@ vanilla_reverse([], Acc) ->
 
     <p>That was once true, but from R6B the BEAM compiler can see
     that a variable is not used.</p>
+
+    <p>Similarly, trivial transformations on the source-code level
+    such as converting a <c>case</c> statement to clauses at the
+    top-level of the function seldom makes any difference to the
+    generated code.</p>
+  </section>
+
+  <section>
+    <title>Myth: A NIF Always Speeds Up Your Program</title>
+
+    <p>Rewriting Erlang code to a NIF to make it faster should be
+    seen as a last resort. It is only guaranteed to be dangerous,
+    but not guaranteed to speed up the program.</p>
+
+    <p>Doing too much work in each NIF call will
+    <seealso marker="erts:erl_nif#WARNING">degrade responsiveness
+    of the VM</seealso>. Doing too little work may mean that
+    the gain of the faster processing in the NIF is eaten up by
+    the overhead of calling the NIF and checking the arguments.</p>
+
+    <p>Be sure to read about
+    <seealso marker="erts:erl_nif#lengthy_work">Long-running NIFs</seealso>
+    before writing a NIF.</p>
   </section>
 </chapter>
 
diff --git a/system/doc/efficiency_guide/part.xml b/system/doc/efficiency_guide/part.xml
index 6e10a0c031..5673ddd320 100644
--- a/system/doc/efficiency_guide/part.xml
+++ b/system/doc/efficiency_guide/part.xml
@@ -39,5 +39,6 @@
   <xi:include href="drivers.xml"/>
   <xi:include href="advanced.xml"/>
   <xi:include href="profiling.xml"/>
+  <xi:include href="retired_myths.xml"/>
 </part>
 
diff --git a/system/doc/efficiency_guide/processes.xml b/system/doc/efficiency_guide/processes.xml
index f2d9712f51..3b64c863ff 100644
--- a/system/doc/efficiency_guide/processes.xml
+++ b/system/doc/efficiency_guide/processes.xml
@@ -4,7 +4,7 @@
 <chapter>
   <header>
     <copyright>
-      <year>2001</year><year>2016</year>
+      <year>2001</year><year>2017</year>
       <holder>Ericsson AB. All Rights Reserved.</holder>
     </copyright>
     <legalnotice>
@@ -146,14 +146,14 @@ loop() ->
     <section>
       <title>Constant Pool</title>
 
-      <p>Constant Erlang terms (also called <em>literals</em>) are now
+      <p>Constant Erlang terms (also called <em>literals</em>) are
       kept in constant pools; each loaded module has its own pool.
-      The following function does no longer build the tuple every time
+      The following function does not build the tuple every time
       it is called (only to have it discarded the next time the garbage
       collector was run), but the tuple is located in the module's
       constant pool:</p>
 
-    <p><em>DO</em> (in R12B and later)</p>
+    <p><em>DO</em></p>
       <code type="erl">
 days_in_month(M) ->
     element(M, {31,28,31,30,31,30,31,31,30,31,30,31}).</code>
@@ -222,7 +222,7 @@ kilo_byte(N, Acc) ->
 
       <pre>
 4> <input>T = ets:new(tab, []).</input>
-17
+#Ref&lt;0.1662103692.2407923716.214181>
 5> <input>ets:insert(T, {key,efficiency_guide:kilo_byte()}).</input>
 true
 6> <input>erts_debug:size(element(2, hd(ets:lookup(T, key)))).</input>
@@ -235,9 +235,7 @@ true
       return the same value. Sharing has been lost.</p>
 
       <p>In a future Erlang/OTP release, it might be implemented a
-      way to (optionally) preserve sharing. There are no plans to make
-      preserving of sharing the default behaviour, as that would
-      penalize the vast majority of Erlang applications.</p>
+      way to (optionally) preserve sharing.</p>
     </section>
   </section>
 
@@ -261,10 +259,6 @@ true
     The estone benchmark, for example, is entirely sequential. So is
     the most common implementation of the "ring benchmark"; usually one process
     is active, while the others wait in a <c>receive</c> statement.</p>
-
-    <p>The <seealso marker="percept:percept">percept</seealso> application
-    can be used to profile your application to see how much potential (or lack
-    thereof) it has for concurrency.</p>
   </section>
 </chapter>
 
diff --git a/system/doc/efficiency_guide/profiling.xml b/system/doc/efficiency_guide/profiling.xml
index bf50a03fa6..f185456158 100644
--- a/system/doc/efficiency_guide/profiling.xml
+++ b/system/doc/efficiency_guide/profiling.xml
@@ -41,30 +41,87 @@
     <p>Erlang/OTP contains several tools to help finding bottlenecks:</p>
 
     <list type="bulleted">
-      <item><c>fprof</c> provides the most detailed information about
-      where the program time is spent, but it significantly slows down the
-      program it profiles.</item>
-
-      <item><p><c>eprof</c> provides time information of each function
-      used in the program. No call graph is produced, but <c>eprof</c> has
-      considerable less impact on the program it profiles.</p>
-      <p>If the program is too large to be profiled by <c>fprof</c> or
-      <c>eprof</c>, the <c>cover</c> and <c>cprof</c> tools can be used
-      to locate code parts that are to be more thoroughly profiled using
-      <c>fprof</c> or <c>eprof</c>.</p></item>
-
-      <item><c>cover</c> provides execution counts per line per
-      process, with less overhead than <c>fprof</c>. Execution counts
-      can, with some caution, be used to locate potential performance
-      bottlenecks.</item>
-
-      <item><c>cprof</c> is the most lightweight tool, but it only
-      provides execution counts on a function basis (for all processes,
-      not per process).</item>
+      <item><p><seealso marker="tools:fprof"><c>fprof</c></seealso> provides
+          the most detailed information about where the program time is spent,
+          but it significantly slows down the program it profiles.</p></item>
+
+      <item><p><seealso marker="tools:eprof"><c>eprof</c></seealso> provides
+          time information of each function used in the program. No call graph is
+          produced, but <c>eprof</c> has considerable less impact on the program it
+          profiles.</p>
+        <p>If the program is too large to be profiled by <c>fprof</c> or
+          <c>eprof</c>, <c>cprof</c> can be used to locate code parts that
+          are to be more thoroughly profiled using <c>fprof</c> or <c>eprof</c>.</p></item>
+
+      <item><p><seealso marker="tools:cprof"><c>cprof</c></seealso> is the
+          most lightweight tool, but it only provides execution counts on a
+          function basis (for all processes, not per process).</p></item>
+
+      <item><p><seealso marker="runtime_tools:dbg"><c>dbg</c></seealso> is the
+          generic erlang tracing frontend. By using the <c>timestamp</c> or
+          <c>cpu_timestamp</c> options it can be used to time how long function
+          calls in a live system take.</p></item>
+
+      <item><p><seealso marker="tools:lcnt"><c>lcnt</c></seealso> is used
+          to find contention points in the Erlang Run-Time System's internal
+          locking mechanisms. It is useful when looking for bottlenecks in
+          interaction between process, port, ets tables and other entities
+          that can be run in parallel.</p></item>
+
     </list>
 
     <p>The tools are further described in
     <seealso marker="#profiling_tools">Tools</seealso>.</p>
+
+    <p>There are also several open source tools outside of Erlang/OTP
+    that can be used to help profiling. Some of them are:</p>
+
+    <list type="bulleted">
+      <item><url href="https://github.com/isacssouza/erlgrind">erlgrind</url>
+      can be used to visualize fprof data in kcachegrind.</item>
+      <item><url href="https://github.com/proger/eflame">eflame</url>
+      is an alternative to fprof that displays the profiling output as a flamegraph.</item>
+      <item><url href="https://ferd.github.io/recon/index.html">recon</url>
+      is a collection of Erlang profiling and debugging tools.
+      This tool comes with an accompanying E-book called
+      <url href="https://www.erlang-in-anger.com/">Erlang in Anger</url>.</item>
+    </list>
+  </section>
+
+  <section>
+    <title>Memory profiling</title>
+    <pre>eheap_alloc: Cannot allocate 1234567890 bytes of memory (of type "heap").</pre>
+    <p>The above slogan is one of the more common reasons for Erlang to terminate.
+      For unknown reasons the Erlang Run-Time System failed to allocate memory to
+      use. When this happens a crash dump is generated that contains information
+      about the state of the system as it ran out of mmeory. Use the
+      <seealso marker="observer:cdv"><c>crashdump_viewer</c></seealso> to get a
+      view of the memory is being used. Look for processes with large heaps or
+      many messages, large ets tables, etc.</p>
+    <p>When looking at memory usage in a running system the most basic function
+      to get information from is <seealso marker="erts:erlang#memory/0"><c>
+      erlang:memory()</c></seealso>. It returns the current memory usage
+      of the system. <seealso marker="tools:instrument"><c>instrument(3)</c></seealso>
+      can be used to get a more detailed breakdown of where memory is used.</p>
+    <p>Processes, ports and ets tables can then be inspecting using their
+      respective info functions, i.e.
+      <seealso marker="erts:erlang#process_info_memory"><c>erlang:process_info/2
+      </c></seealso>,
+      <seealso marker="erts:erlang#port_info_memory"><c>erlang:port_info/2
+      </c></seealso> and
+      <seealso marker="stdlib:ets#info/1"><c>ets:info/1</c></seealso>.
+    </p>
+    <p>Sometimes the system can enter a state where the reported memory
+      from <c>erlang:memory(total)</c> is very different from the
+      memory reported by the OS. This can be because of internal
+      fragmentation within the Erlang Run-Time System. Data about
+      how memory is allocated can be retrieved using
+      <seealso marker="erts:erlang#system_info_allocator">
+        <c>erlang:system_info(allocator)</c></seealso>.
+      The data you get from that function is very raw and not very plesant to read.
+      <url href="http://ferd.github.io/recon/recon_alloc.html">recon_alloc</url>
+      can be used to extract useful information from system_info
+      statistics counters.</p>
   </section>
 
   <section>
@@ -80,6 +137,22 @@
       tools on the whole system. Instead you want to concentrate on
       central processes and modules, which contribute for a big part
       of the execution.</p>
+
+    <p>There are also some tools that can be used to get a view of the
+      whole system with more or less overhead.</p>
+    <list type="bulleted">
+      <item><seealso marker="observer:observer"><c>observer</c></seealso>
+      is a GUI tool that can connect to remote nodes and display a
+      variety of information about the running system.</item>
+      <item><seealso marker="observer:etop"><c>etop</c></seealso>
+      is a command line tool that can connect to remote nodes and
+      display information similar to what the UNIX tool top shows.</item>
+      <item><seealso marker="runtime_tools:msacc"><c>msacc</c></seealso>
+      allows the user to get a view of what the Erlang Run-Time system
+      is spending its time doing. Has a very low overhead, which makes it
+      useful to run in heavily loaded systems to get some idea of where
+      to start doing more granular profiling.</item>
+    </list>
   </section>
 
   <section>
@@ -128,7 +201,7 @@
       performance impact. Using <c>fprof</c> is just a matter of
       calling a few library functions, see the
       <seealso marker="tools:fprof">fprof</seealso> manual page in
-      Tools .<c>fprof</c> was introduced in R8.</p>
+      Tools.</p>
     </section>
 
     <section>
@@ -142,20 +215,6 @@
     </section>
 
     <section>
-      <title>cover</title>
-      <p>The primary use of <c>cover</c> is coverage analysis to verify
-      test cases, making sure that all relevant code is covered.
-      <c>cover</c> counts how many times each executable line of code
-      is executed when a program is run, on a per module basis.</p>
-      <p>Clearly, this information can be used to determine what
-      code is run very frequently and can therefore be subject for
-      optimization. Using <c>cover</c> is just a matter of calling a
-      few library functions, see the
-      <seealso marker="tools:cover">cover</seealso> manual page in
-      Tools.</p>
-    </section>
-
-    <section>
       <title>cprof</title>
       <p><c>cprof</c> is something in between <c>fprof</c> and
       <c>cover</c> regarding features. It counts how many times each
@@ -202,16 +261,6 @@
           <cell>No</cell>
         </row>
         <row>
-          <cell><c>cover</c></cell>
-          <cell>Per module to screen/file</cell>
-          <cell>Small</cell>
-          <cell>Moderate slowdown</cell>
-          <cell>Yes, per line</cell>
-          <cell>No</cell>
-          <cell>No</cell>
-          <cell>No</cell>
-        </row>
-        <row>
           <cell><c>cprof</c></cell>
           <cell>Per module to caller</cell>
           <cell>Small</cell>
@@ -224,6 +273,37 @@
         <tcaption>Tool Summary</tcaption>
       </table>
     </section>
+
+    <section>
+      <title>dbg</title>
+      <p><c>dbg</c> is a generic Erlang trace tool. By using the
+      <c>timestamp</c> or <c>cpu_timestamp</c> options it can be used
+      as a precision instrument to profile how long time a function
+      call takes for a specific process. This can be very useful when
+      trying to understand where time is spent in a heavily loaded
+      system as it is possible to limit the scope of what is profiled
+      to be very small.
+      For more information, see the
+      <seealso marker="runtime_tools:dbg">dbg</seealso> manual page in
+      Runtime Tools.</p>
+    </section>
+
+    <section>
+      <title>lcnt</title>
+      <p><c>lcnt</c> is used to profile interactions inbetween
+        entities that run in parallel. For example if you have
+        a process that all other processes in the system needs
+        to interact with (maybe it has some global configuration),
+        then <c>lcnt</c> can be used to figure out if the interaction
+        with that process is a problem.</p>
+      <p>In the Erlang Run-time System entities are only run in parallel
+        when there are multiple schedulers. Therefore <c>lcnt</c> will
+        show more contention points (and thus be more useful) on systems
+        using many schedulers on many cores.</p>
+      <p>For more information, see the
+        <seealso marker="tools:lcnt">lcnt</seealso> manual page in Tools.</p>
+    </section>
+
   </section>
 
   <section>
@@ -282,4 +362,3 @@
     </list>
   </section>
 </chapter>
-
diff --git a/system/doc/efficiency_guide/retired_myths.xml b/system/doc/efficiency_guide/retired_myths.xml
new file mode 100644
index 0000000000..9b914a3b6e
--- /dev/null
+++ b/system/doc/efficiency_guide/retired_myths.xml
@@ -0,0 +1,63 @@
+<?xml version="1.0" encoding="utf-8" ?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+
+<chapter>
+  <header>
+    <copyright>
+      <year>2016</year>
+      <year>2017</year>
+      <holder>Ericsson AB, All Rights Reserved</holder>
+    </copyright>
+    <legalnotice>
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+
+  The Initial Developer of the Original Code is Ericsson AB.
+    </legalnotice>
+    <title>Retired Myths</title>
+    <prepared>Bjorn Gustavsson</prepared>
+    <docno></docno>
+    <date>2016-06-07</date>
+    <rev></rev>
+    <file>retired_myths.xml</file>
+  </header>
+
+  <p>We belive that the truth finally has caught with the following,
+  retired myths.</p>
+
+  <section>
+    <marker id="retired_myths"/>
+    <title>Myth: Funs are Slow</title>
+    <p>Funs used to be very slow, slower than <c>apply/3</c>.
+    Originally, funs were implemented using nothing more than
+    compiler trickery, ordinary tuples, <c>apply/3</c>, and a great
+    deal of ingenuity.</p>
+
+    <p>But that is history. Funs was given its own data type
+    in R6B and was further optimized in R7B.
+    Now the cost for a fun call falls roughly between the cost for a call
+    to a local function and <c>apply/3</c>.</p>
+  </section>
+
+  <section>
+    <title>Myth: List Comprehensions are Slow</title>
+
+    <p>List comprehensions used to be implemented using funs, and in the
+    old days funs were indeed slow.</p>
+
+    <p>Nowadays, the compiler rewrites list comprehensions into an ordinary
+    recursive function. Using a tail-recursive function with
+    a reverse at the end would be still faster. Or would it?
+    That leads us to the myth that tail-recursive functions are faster
+    than body-recursive functions.</p>
+  </section>
+</chapter>
diff --git a/system/doc/efficiency_guide/xmlfiles.mk b/system/doc/efficiency_guide/xmlfiles.mk
index 88df9417f5..23c0d991b4 100644
--- a/system/doc/efficiency_guide/xmlfiles.mk
+++ b/system/doc/efficiency_guide/xmlfiles.mk
@@ -29,5 +29,5 @@ EFF_GUIDE_CHAPTER_FILES = \
 	processes.xml \
 	profiling.xml \
 	tablesDatabases.xml \
-	drivers.xml
-
+	drivers.xml \
+	retired_myths.xml