Optimize tail-recursive calls of BIFs

BEAM currently does not call BIFs at the end of a function in a tail-recursive way. That is, when calling a BIF at the end of a function, the BIF is first called, and then the stack frame is deallocated, and then control is transferred to the caller. If there is no stack frame when a BIF is called in the tail position, the loader will emit a sequence of three instructions: first an instruction that allocates a stack frame and saves the continuation pointer (`allocate`), then an instruction that calls the BIF (`call_bif`), and lastly an instruction that deallocates the stack frame and returns to the caller (`deallocate_return`). The old compiler would essentially allocate a stack frame for each clause in a function, so it would not be that common that a BIF was called in the tail position when there was no stack frame, so the three-instruction sequence was deemed acceptable. The new compiler only allocates stack frames when truly needed, so the three-instruction BIF call sequence has become much more common. This commit introduces a new `call_bif_only` instruction so that only one instruction will be needed when calling a BIF in the tail position when there is no stack frame. This instruction is also used when there is a stack frame to make it possible to deallocate the stack frame **before** calling the BIF, which may make a subsequent garbage collection at the end of the BIF call cheaper (copying less garbage). The one downside of this change is that the function that called the BIF will not be included in the stack backtrace (similar to how a tail-recursive call to an Erlang function will not be included in the backtrace). That was the quick summary of the commit. Here comes a detailed look at how BIF calls are translated by the loader. The first example is a function that calls `setelement/3` in the tail position: update_no_stackframe(X) -> setelement(5, X, new_value). Here is the BEAM code: {function, update_no_stackframe, 1, 12}. {label,11}. {line,[...]}. {func_info,{atom,t},{atom,update_no_stackframe},1}. {label,12}. {move,{x,0},{x,1}}. {move,{atom,new_value},{x,2}}. {move,{integer,5},{x,0}}. {line,[...]}. {call_ext_only,3,{extfunc,erlang,setelement,3}}. Because there is no stack frame, the `call_ext_only` instruction will be used to call `setelement/3`: {call_ext_only,3,{extfunc,erlang,setelement,3}}. The loader will transform this instruction to a three-instruction sequence: 0000000020BD8130: allocate_tt 0 3 0000000020BD8138: call_bif_e erlang:setelement/3 0000000020BD8148: deallocate_return_Q 0 Using the `call_bif_only` instruction introduced in this commit, only one instruction is needed: 000000005DC377F0: call_bif_only_e erlang:setelement/3 `call_bif_only` calls the BIF and returns to the caller. Now let's look at a function that already has a stack frame when `setelement/3` is called: update_with_stackframe(X) -> foobar(X), setelement(5, X, new_value). Here is the BEAM code: {function, update_with_stackframe, 1, 14}. {label,13}. {line,[...]}. {func_info,{atom,t},{atom,update_with_stackframe},1}. {label,14}. {allocate,1,1}. {move,{x,0},{y,0}}. {line,[...]}. {call,1,{f,16}}. {move,{y,0},{x,1}}. {move,{atom,new_value},{x,2}}. {move,{integer,5},{x,0}}. {line,[...]}. {call_ext_last,3,{extfunc,erlang,setelement,3},1}. Since there is a stack frame, the `call_ext_last` instruction will be used to deallocate the stack frame and call the function: {call_ext_last,3,{extfunc,erlang,setelement,3},1}. Before this commit, the loader would translate this instruction to: 0000000020BD81B8: call_bif_e erlang:setelement/3 0000000020BD81C8: deallocate_return_Q 1 That is, the BIF is called before deallocating the stack frame and returning to the calling function. After this commit, the loader will translate the `call_ext_last` like this: 000000005DC37868: deallocate_Q 1 000000005DC37870: call_bif_only_e erlang:setelement/3 There are still two instructions, but now the stack frame will be deallocated before calling the BIF, which could make the potential garbage collection after the BIF call slightly more efficient (copying less garbage). We could have introduced a `call_bif_last` instruction, but the code for calling a BIF is relatively large and there does not seem be a practical way to share the code between `call_bif` and `call_bif_only` (since the difference is at the end, after the BIF call). Therefore, we did not want to clone the BIF calling code yet another time to make a `call_bif_last` instruction.
author: Björn Gustavsson <[email protected]> 2019-03-07 10:00:39 +0100
committer: Björn Gustavsson <[email protected]> 2019-03-09 09:31:08 +0100
commit: 2d2e78ad6e667655560a67e848153dbb218914f7 (patch)
tree: f14140ca46e4eeaacf2895d59a12772f6f1dac33
parent: 3066a5bf51467d1f8f0b05cf7b7bab0ef6a17578 (diff)
download: otp-2d2e78ad6e667655560a67e848153dbb218914f7.tar.gz
otp-2d2e78ad6e667655560a67e848153dbb218914f7.tar.bz2
otp-2d2e78ad6e667655560a67e848153dbb218914f7.zip
5 files changed, 139 insertions, 28 deletions
diff --git a/erts/emulator/beam/bif.c b/erts/emulator/beam/bif.c
index 7faba35e1c..c102ddbee6 100644
--- a/erts/emulator/beam/bif.c
+++ b/erts/emulator/beam/bif.c
@@ -5165,6 +5165,12 @@ erts_schedule_bif(Process *proc,
 	    pc = i;
 	    mfa = &exp->info.mfa;
 	}
+	else if (BeamIsOpCode(*i, op_call_bif_only_e)) {
+	    /* Pointer to bif export in i+1 */
+	    exp = (Export *) i[1];
+	    pc = i;
+	    mfa = &exp->info.mfa;
+	}
 	else if (BeamIsOpCode(*i, op_apply_bif)) {
 	    /* Pointer to bif in i+1, and mfa in i-3 */	    
 	    pc = c_p->cp;
diff --git a/erts/emulator/beam/bif_instrs.tab b/erts/emulator/beam/bif_instrs.tab
index 8499f61114..8e0caa38a3 100644
--- a/erts/emulator/beam/bif_instrs.tab
+++ b/erts/emulator/beam/bif_instrs.tab
@@ -209,8 +209,8 @@ i_length.execute(Fail, Live, Dst) {
 }
 
 //
-// The most general BIF call.  The BIF may build any amount of data
-// on the heap.  The result is always returned in r(0).
+// Call a BIF, store the result in x(0) and transfer control to the
+// next instruction.
 //
 call_bif(Exp) {
     ErtsBifFunc bf;
@@ -219,8 +219,10 @@ call_bif(Exp) {
     Export *export = (Export*) $Exp;
 
     if (!((FCALLS - 1) > 0 || (FCALLS-1) > neg_o_reds)) {
-        /* If we have run out of reductions, we do a context
-           switch before calling the bif */
+        /*
+         * If we have run out of reductions, do a context
+         * switch before calling the BIF.
+         */
         c_p->arity = GET_BIF_ARITY(export);
         c_p->current = &export->info.mfa;
         goto context_switch3;
@@ -257,9 +259,12 @@ call_bif(Exp) {
     HTOP = HEAP_TOP(c_p);
     FCALLS = c_p->fcalls;
     ERTS_DBG_CHK_REDS(c_p, FCALLS);
-    /* We have to update the cache if we are enabled in order
-       to make sure no book keeping is done after we disabled
-       msacc. We don't always do this as it is quite expensive. */
+
+    /*
+     * We have to update the cache if we are enabled in order
+     * to make sure no bookkeeping is done after we disabled
+     * msacc. We don't always do this as it is quite expensive.
+     */
     if (ERTS_MSACC_IS_ENABLED_CACHED_X()) {
         ERTS_MSACC_UPDATE_CACHE_X();
     }
@@ -269,6 +274,12 @@ call_bif(Exp) {
         CHECK_TERM(r(0));
         $NEXT0();
     } else if (c_p->freason == TRAP) {
+        /*
+         * Set the continuation pointer to return to next
+         * instruction after the trap (either by a return from
+         * erlang code or by nif_bif.epilogue() when the BIF
+         * is done).
+         */
         SET_CP(c_p, $NEXT_INSTRUCTION);
         SET_I(c_p->i);
         SWAPIN;
@@ -281,6 +292,95 @@ call_bif(Exp) {
     ASSERT(c_p->stop == E);
     I = handle_error(c_p, I, reg, &export->info.mfa);
     goto post_error_handling;
+    //| -no_next
+}
+
+//
+// Call a BIF tail-recursively, storing the result in x(0) and doing
+// a return to the continuation poiner (c_p->cp).
+//
+
+call_bif_only(Exp) {
+    ErtsBifFunc bf;
+    Eterm result;
+    ErlHeapFragment *live_hf_end;
+    Export *export = (Export*) $Exp;
+
+    if (!((FCALLS - 1) > 0 || (FCALLS-1) > neg_o_reds)) {
+        /*
+         * If we have run out of reductions, do a context
+         * switch before calling the BIF.
+         */
+        c_p->arity = GET_BIF_ARITY(export);
+        c_p->current = &export->info.mfa;
+        goto context_switch3;
+    }
+
+    ERTS_MSACC_SET_BIF_STATE_CACHED_X(GET_BIF_MODULE(export),
+                                      GET_BIF_ADDRESS(export));
+
+    bf = GET_BIF_ADDRESS(export);
+
+    PRE_BIF_SWAPOUT(c_p);
+    ERTS_DBG_CHK_REDS(c_p, FCALLS);
+    c_p->fcalls = FCALLS - 1;
+    if (FCALLS <= 0) {
+        save_calls(c_p, export);
+    }
+    ASSERT(!ERTS_PROC_IS_EXITING(c_p));
+    ERTS_VERIFY_UNUSED_TEMP_ALLOC(c_p);
+    live_hf_end = c_p->mbuf;
+    ERTS_CHK_MBUF_SZ(c_p);
+    result = (*bf)(c_p, reg, I);
+    ERTS_CHK_MBUF_SZ(c_p);
+    ASSERT(!ERTS_PROC_IS_EXITING(c_p) || is_non_value(result));
+    ERTS_VERIFY_UNUSED_TEMP_ALLOC(c_p);
+    ERTS_HOLE_CHECK(c_p);
+    ERTS_REQ_PROC_MAIN_LOCK(c_p);
+    if (ERTS_IS_GC_DESIRED(c_p)) {
+        Uint arity = GET_BIF_ARITY(export);
+        result = erts_gc_after_bif_call_lhf(c_p, live_hf_end, result,
+                                            reg, arity);
+        E = c_p->stop;
+    }
+    PROCESS_MAIN_CHK_LOCKS(c_p);
+    HTOP = HEAP_TOP(c_p);
+    FCALLS = c_p->fcalls;
+    ERTS_DBG_CHK_REDS(c_p, FCALLS);
+
+    /*
+     * We have to update the cache if we are enabled in order
+     * to make sure no bookkeeping is done after we disabled
+     * msacc. We don't always do this as it is quite expensive.
+     */
+    if (ERTS_MSACC_IS_ENABLED_CACHED_X()) {
+        ERTS_MSACC_UPDATE_CACHE_X();
+    }
+    ERTS_MSACC_SET_STATE_CACHED_M_X(ERTS_MSACC_STATE_EMULATOR);
+    if (ERTS_LIKELY(is_value(result))) {
+        /*
+         * Success. Store the result and return to the caller.
+         */
+        r(0) = result;
+        CHECK_TERM(r(0));
+        $return();
+    } else if (c_p->freason == TRAP) {
+        /*
+         * Dispatch to a trap. When the trap is done, a jump
+         * to the continuation pointer (c_p->cp) will be done.
+         */
+        SET_I(c_p->i);
+        SWAPIN;
+        Dispatch();
+    }
+
+    /*
+     * Error handling.  SWAPOUT is not needed because it was done above.
+     */
+    ASSERT(c_p->stop == E);
+    I = handle_error(c_p, I, reg, &export->info.mfa);
+    goto post_error_handling;
+    //| -no_next
 }
 
 //
diff --git a/erts/emulator/beam/ops.tab b/erts/emulator/beam/ops.tab
index e688c6996b..da5364183c 100644
--- a/erts/emulator/beam/ops.tab
+++ b/erts/emulator/beam/ops.tab
@@ -74,23 +74,19 @@ trace_jump W
 
 return
 
+# To ensure that a "move Src x(0)" instruction can be combined with
+# the following call instruction, we need to make sure that there is
+# no line/1 instruction between the move and the call.
 #
-# To ensure that a "move Src x(0)" instruction can be combined
-# with the following call instruction, we need to make sure that
-# there is no line/1 instruction between the move and the call.
-#
-# A tail-recursive call to an external function (non-BIF) will
-# never be saved on the stack, so there is no reason to keep
-# the line instruction. (The compiler did not remove the line
-# instruction because it cannot tell the difference between
-# BIFs and ordinary Erlang functions.)
-#
+# A tail-recursive call to an external function (BIF or non-BIF) will
+# never be saved on the stack, so there is no reason to keep the line
+# instruction.
 
 move S X0=x==0 | line Loc | call_ext Ar Func => \
      line Loc | move S X0 | call_ext Ar Func
-move S X0=x==0 | line Loc | call_ext_last Ar Func=u$is_not_bif D => \
+move S X0=x==0 | line Loc | call_ext_last Ar Func D => \
      move S X0 | call_ext_last Ar Func D
-move S X0=x==0 | line Loc | call_ext_only Ar Func=u$is_not_bif => \
+move S X0=x==0 | line Loc | call_ext_only Ar Func => \
      move S X0 | call_ext_only Ar Func
 move S X0=x==0 | line Loc | call Ar Func => \
      line Loc | move S X0 | call Ar Func
@@ -102,9 +98,9 @@ line I
 allocate t t?
 allocate_heap t I t?
 
-%cold
+# This instruction when a BIF is called tail-recursively when
+# ther is stack frame.
 deallocate Q
-%hot
 
 init y
 allocate_zero t t?
@@ -985,10 +981,9 @@ call_ext_only u==0 u$func:os:perf_counter/0 => \
 
 call_ext u Bif=u$is_bif => call_bif Bif
 
-call_ext_last u Bif=u$is_bif D => call_bif Bif | deallocate_return D
+call_ext_last u Bif=u$is_bif D => deallocate D | call_bif_only Bif
 
-call_ext_only Ar=u Bif=u$is_bif => \
-  allocate u Ar | call_bif Bif | deallocate_return u
+call_ext_only Ar=u Bif=u$is_bif => call_bif_only Bif
 
 #
 # Any remaining calls are calls to Erlang functions, not BIFs.
@@ -1020,6 +1015,7 @@ i_perf_counter
 %hot
 
 call_bif e
+call_bif_only e
 
 #
 # Calls to non-building and guard BIFs.
diff --git a/erts/emulator/test/trace_local_SUITE.erl b/erts/emulator/test/trace_local_SUITE.erl
index 253d5fed23..ad802352b9 100644
--- a/erts/emulator/test/trace_local_SUITE.erl
+++ b/erts/emulator/test/trace_local_SUITE.erl
@@ -1181,7 +1181,9 @@ undef(X) ->
     ?MODULE:undef(X, X). % undef
 
 lists_reverse(A, B) ->
-    lists:reverse(A, B).
+    Res = lists:reverse(A, B),
+    _ = (catch abs(A)),
+    Res.
 
 
 
diff --git a/lib/compiler/test/beam_except_SUITE.erl b/lib/compiler/test/beam_except_SUITE.erl
index 9380fe06c8..8e3b373d29 100644
--- a/lib/compiler/test/beam_except_SUITE.erl
+++ b/lib/compiler/test/beam_except_SUITE.erl
@@ -84,9 +84,16 @@ coverage(_) ->
     {'EXIT',{function_clause,
 	     [{?MODULE,fc,[y],[File,{line,2}]}|_]}} =
 	(catch fc(y)),
-    {'EXIT',{function_clause,
-	     [{?MODULE,fc,[[a,b,c]],[File,{line,6}]}|_]}} =
-	(catch fc([a,b,c])),
+    case ?MODULE of
+        beam_except_no_opt_SUITE ->
+            %% There will be a different stack fram in
+            %% unoptimized code.
+            ok;
+        _ ->
+            {'EXIT',{function_clause,
+                     [{?MODULE,fc,[[a,b,c]],[File,{line,6}]}|_]}} =
+                (catch fc([a,b,c]))
+    end,
 
     {'EXIT',{undef,[{erlang,error,[a,b,c],_}|_]}} =
 	(catch erlang:error(a, b, c)),
author	Björn Gustavsson <[email protected]>	2019-03-07 10:00:39 +0100
committer	Björn Gustavsson <[email protected]>	2019-03-09 09:31:08 +0100
commit	2d2e78ad6e667655560a67e848153dbb218914f7 (patch)
tree	f14140ca46e4eeaacf2895d59a12772f6f1dac33
parent	3066a5bf51467d1f8f0b05cf7b7bab0ef6a17578 (diff)
download	otp-2d2e78ad6e667655560a67e848153dbb218914f7.tar.gz otp-2d2e78ad6e667655560a67e848153dbb218914f7.tar.bz2 otp-2d2e78ad6e667655560a67e848153dbb218914f7.zip