Age | Commit message (Collapse) | Author |
|
Certain complex receive statements would result in an internal
compiler failure. That would happen when the compiler would fail
to find the common exit block following a receive. See the added
test case for an example.
https://bugs.erlang.org/browse/ERL-950
|
|
|
|
The expansion of record field updates, when more than one field is
updated, but not a majority of the fields, will create a sequence of
calls to `erlang:setelement(Index, Value, Tuple)` where Tuple in the
first call is the original record tuple, and in the subsequent calls
Tuple is the result of the previous call. Furthermore, all Index
values are constant positive integers, and the first call to
`setelement` will have the greatest index. Thus all the following
calls do not actually need to test at run-time whether Tuple has type
tuple, nor that the index is within the tuple bounds.
Since OTP R7, the `sys_core_dsetel` pass, run as the very last Core
Erlang pass, has optimized this sequence of `setelement` calls to use
a special destructive version of `setelement` (called
`set_tuple_element`) for all but the very first `setelement` in the
sequence.
It turns out that the presence of the `set_tuple_element` in SSA code
is awkward and can prevent or complicate type analysis and aggressive
optimizations.
Therefore, this commit removes the `sys_core_dsetel` pass and
reimplements it for SSA code. The optimization will be done in the
`beam_ssa_pre_codegen` pass (that is, just before code generation and
after running all other SSA code optimization passes).
In most cases, the resulting BEAM code is identical to previous
code. For a few modules, the BEAM code is actually slightly better,
with smaller stack frames.
|
|
Optimize ssa_opt_sink for huge functions
|
|
The ssa_opt_sink optimization of beam_ssa_opt could get very slow
for certain huge functions. 9a190cae9bd7 partly addressed this issue
by terminating the optimization early if there happened to be no
get_tuple_element instructions at all in the function.
This commit addresses the issue more directly by making the dominator
calculation in beam_ssa:dominators/1 more efficient. The same
algorithm as before is used, but it is implemented in a more efficient
way based on the ideas in "A Simple, Fast Dominance Algorithm"
(http://www.hipersoft.rice.edu/grads/publications/dom14.pdf).
As well as being more efficient, the new implementation also gives
an explicit representation of the dominator tree, which makes it
possible to simplify and optimize the ssa_opt_sink optimization.
|
|
Prior to 294d66a295f6c2101fe3c2da630979ad4e736c08 there wasn't much
point to keeping track of tuple element types; they were only known
when we had inserted or extracted values from a tuple, and in
neither case was it likely that we'd extract the same values again.
It makes a lot more sense to do so now that type optimizations are
applied across functions; if we return a tuple it's very likely
that its elements will be extracted soon after, and knowing their
types lets us eliminate more type checks.
Co-authored-by: Björn Gustavsson <[email protected]>
|
|
Avoiding calls usually reduces the size of the stack frame and reduces
register shuffling.
|
|
Reduce redundant moves and register shuffling
|
|
Apply type optimizations across local function calls
|
|
This commit lets the type optimization pass work across functions,
tracking return and argument types to eliminate redundant tests.
|
|
Consider this function and its corresponding BEAM code:
foo(Map, Key) ->
Val = case Map of
#{Key:=Val0} -> Val0;
_ -> default
end,
bar(1, 2, Val).
{label,2}.
{test,is_map,{f,3},[{x,0}]}.
{get_map_elements,{f,3},{x,0},{list,[{x,1},{x,0}]}}.
^^^^^
{jump,{f,4}}.
{label,3}.
{move,{atom,default},{x,0}}.
^^^^^
{label,4}.
{move,{integer,2},{x,1}}.
{move,{x,0},{x,2}}.
^^^^^
{move,{integer,1},{x,0}}.
{call_only,3,{f,6}}.
Note that the value of the variable `Val` will first be
placed in `{x,0}` and then moved to `{x,2}` where it needs
to be when calling the `bar/3` function.
The reason for the extra `move` instruction is that the
register allocator picks the lowest numbered available register
when choosing a register to put a variable in. In this case,
`{x,0}` will be chosen.
If we only could give a hint to the register allocator that
it would be better to put `Val` in `{x,2}`, the extra `move`
would disappear:
{label,2}.
{test,is_map,{f,3},[{x,0}]}.
{get_map_elements,{f,3},{x,0},{list,[{x,1},{x,2}]}}.
{jump,{f,4}}.
{label,3}.
{move,{atom,default},{x,2}}.
{label,4}.
{move,{integer,2},{x,1}}.
{move,{integer,1},{x,0}}.
{call_only,3,{f,6}}.
There already is an existing sub pass (`reserve_regs`) in
`beam_ssa_pre_codegen` that among things tries to give the register
allocator hints that some variables should be placed in specific
registers, if possible. However, the existing hinting mechanism is
limited, essentially only working within a single SSA block.
This commit extends the hinting mechanism, allowing hints to be passed
across SSA blocks, eliminating `move` instructions and register
shuffling in many places. (494 modules out of a sample of 1236 modules
were changed by this commit.)
|
|
Use lists:splitwith/2 instead of lists:partition/2 for splitting out
phi nodes. Since phi nodes are always the first instructions in a
block, the result will be the same, but splitwith/2 is faster.
|
|
There is no easy way to convert xor or is_record/2 to test operations.
|
|
|
|
The following function:
is_two_tuple(Arg) ->
case is_tuple(Arg) of
false -> false;
true -> tuple_size(Arg) == 2
end.
would cause an internal consistency failure:
Internal consistency check failed - please report this bug.
Instruction: {bif,tuple_size,{f,0},[{x,0}],{z,0}}
Error: {invalid_store,{z,0},{integer,[]}}:
|
|
Enhance the copy_retval/1 optimization to allow Y registers to be
reused in more circumstances. Reusing Y register can often reduce
the size of the stack frame.
|
|
The linear scan algorithm keeps unhandled intervals in two separate
lists: one for intervals with reserved registers and one for intervals
without reserved registers. When collecting the available free registers
all unhandled intervals with reserved registers must be checked for
overlap.
Unhandled intervals that had a preferred register were put into the
list of intervals with reserved registers, leading to a lot of
unecessary overlap checking if there were many intervals with
preferred registers. Changing the partition code to put intervals with
preferred registers into the general list of unhandled intervals will
reduce the compilation time if there are many preferred registers.
On my computer, this change reduced the time of the linear scan pass
from about 20 seconds down to about 0.5 seconds for this module:
https://github.com/aggelgian/cuter/blob/master/src/cuter_binlib.erl
Noticed-by: Kostis Sagonas
|
|
Fixes a crash during code generation of the following code:
call_atom() ->
fun({send = Send}) ->
Send()
end.
|
|
* bjorn/compiler/fix-r21-option:
Fix code generation of binary instructions with the r21 option
|
|
Minor cleanups and bug fixes of the compiler
|
|
OTP 22 extends the binary instructions to support a Y register
destination. When giving an option to compile for an earlier
release, make sure that binary instructions don't use a Y register
destination, by rewriting the binary instructions to use an X
register destination and adding a `move` instruction to move
the value to the Y register.
|
|
Rewrite BSM optimizations in the new SSA-based intermediate format
|
|
Remove the variable aliasing support that was needed for the
old beam_bsm pass.
|
|
This commit improves the bit-syntax match optimization pass,
leveraging the new SSA intermediate format to perform much more
aggressive optimizations. Some highlights:
* Watch contexts can be reused even after being passed to a
function or being used in a try block.
* Sub-binaries are no longer eagerly extracted, making it far
easier to keep "happy paths" free from binary creation.
* Trivial wrapper functions no longer disable context reuse.
|
|
The source map for `get_map_element` is not supposed to be a literal,
and the sanitization code is supposed to ensure that.
The sanitization incorrectly translated this code:
Map = get_tuple_element literal {ok,#{key=>value}}, literal 1
Var = get_map_element Map, literal {a,key}
To this code:
Var = get_map_element literal #{key=>value}, literal {a,key}
Make sure to substitute the arguments for `get_map_element` before
looking for a literal source map.
|
|
|
|
We chose to refer to variables through their var_name() because we
anticipated the need to annotate them, but it turned out we didn't
really need that, and many things become a lot cleaner if the
entire #b_var{} is used to represent variables.
|
|
When creating a phi node for the common exit block of a receive,
the code failed to take into account that there could be more
than one predecessor to the exit block for each remove_message.
Rename exit_predessor/3 to exit_predessors/3 and make it return a
list of the predecessors.
|
|
Sometimes when building a tuple, there is no way to avoid an
extra `move` instruction. Consider this code:
make_tuple(A) -> {ok,A}.
The corresponding BEAM code looks like this:
{test_heap,3,1}.
{put_tuple,2,{x,1}}.
{put,{atom,ok}}.
{put,{x,0}}.
{move,{x,1},{x,0}}.
return.
To avoid overwriting the source register `{x,0}`, a `move`
instruction is necessary.
The problem doesn't exist when building a list:
%% build_list(A) -> [A].
{test_heap,2,1}.
{put_list,{x,0},nil,{x,0}}.
return.
Introduce a new `put_tuple2` instruction that builds a tuple in a
single instruction, so that the `move` instruction can be eliminated:
%% make_tuple(A) -> {ok,A}.
{test_heap,3,1}.
{put_tuple2,{x,0},{list,[{atom,ok},{x,0}]}}.
return.
Note that the BEAM loader already combines `put_tuple` and `put`
instructions into an internal instruction similar to `put_tuple2`.
Therefore the introduction of the new instruction will not speed up
execution of tuple building itself, but it will be less work for
the loader to load the new instruction.
|
|
v3_codegen is replaced by three new passes:
* beam_kernel_to_ssa which translates the Kernel Erlang format
to a new SSA-based intermediate format.
* beam_ssa_pre_codegen which prepares the SSA-based format
for code generation, including register allocation. Registers
are allocated using the linear scan algorithm.
* beam_ssa_codegen which generates BEAM assembly code from the
SSA-based format.
It easier and more effective to optimize the SSA-based format before X
and Y registers have been assigned. The current optimization passes
constantly have to make sure no "holes" in the X register assignments
are created (that is, that no X register becomes undefined that an
allocation instruction depends on).
This commit also introduces the following optimizations:
* Replacing of tuple matching of records with the is_tagged_tuple
instruction. (Replacing beam_record.)
* Sinking of get_tuple_element instructions to just before the first
use of the extracted values. As well as potentially avoiding
extracting tuple elements when they are not actually used on all
executions paths, this optimization could also reduce the number
values that will need to be stored in Y registers. (Similar to
beam_reorder, but more effective.)
* Live optimizations, removing the definition of a variable that is
not subsequently used (provided that the operation has no side
effects), as well strength reduction of binary matching by replacing
the extraction of value from a binary with a skip instruction. (Used
to be done by beam_block, beam_utils, and v3_codegen.)
* Removal of redundant bs_restore2 instructions. (Formerly done
by beam_bs.)
* Type-based optimizations across branches. More effective than
the old beam_type pass that only did type-based optimizations in
basic blocks.
* Optimization of floating point instructions. (Formerly done
by beam_type.)
* Optimization of receive statements to introduce recv_mark and
recv_set instructions. More effective with far fewer restrictions
on what instructions are allowed between creating the reference
and entering the receive statement.
* Common subexpression elimination. (Formerly done by beam_block.)
|