Age | Commit message (Collapse) | Author |
|
Sort sequences of `move` instructions on the Y register.
When moving from X registers to Y registers, having the instructions
sorted on Y registers give the loader more opportunities to use
`move_window{3,4,5}` instructions. For examples, the following five
instructions:
move_xy x(2) y(0)
move_xy x(1) y(1)
move_xy x(0) y(2)
move_xy x(5) y(3)
move_xy x(4) y(4)
can be replaced with:
move_window5_xxxxxy x(2) x(1) x(0) x(5) x(4) y(0)
When the Y registers are not ordered so that `move_window5` can be
used, the loader would typically combine the first three moves to a
`move3_xyxyxy` instruction and the last two moves to a
`move2_par_xyxy` instruction.
When moving from Y registers to X registers, sorting on the Y
registers could potentially be more cache-friendly. It could also
be worthwhile investigating a new `move_window` instruction in
the BEAM interpreter that could move values from contiguous Y registers
to X registers.
Note that `scripts/diffable` can generate diffable dissambly files for
the loaded BEAM code:
$ scripts/diffable --dis 0
$ scripts/diffable --dis 1
$ diff -u 0 1
|
|
Moving away this optimization makes beam_block do one thing
and one thing only -- creating blocks.
|
|
Most of the optimizations in beam_dead have been superseded
by the optimizations in beam_ssa_dead.
The forward/1 pass of beam_dead has been moved to beam_jump.
The beam_split pass splits blocks that contain instructions with
non-zero labels. Because there are no optimizations left that optimize
instructions within blocks, beam_block never needs to put such
instructions into blocks in the first place. beam_split also moved
'move' instructions out block to help beam_dead. That is no longer
necessary since beam_dead no longer exists.
|
|
Sometimes when building a tuple, there is no way to avoid an
extra `move` instruction. Consider this code:
make_tuple(A) -> {ok,A}.
The corresponding BEAM code looks like this:
{test_heap,3,1}.
{put_tuple,2,{x,1}}.
{put,{atom,ok}}.
{put,{x,0}}.
{move,{x,1},{x,0}}.
return.
To avoid overwriting the source register `{x,0}`, a `move`
instruction is necessary.
The problem doesn't exist when building a list:
%% build_list(A) -> [A].
{test_heap,2,1}.
{put_list,{x,0},nil,{x,0}}.
return.
Introduce a new `put_tuple2` instruction that builds a tuple in a
single instruction, so that the `move` instruction can be eliminated:
%% make_tuple(A) -> {ok,A}.
{test_heap,3,1}.
{put_tuple2,{x,0},{list,[{atom,ok},{x,0}]}}.
return.
Note that the BEAM loader already combines `put_tuple` and `put`
instructions into an internal instruction similar to `put_tuple2`.
Therefore the introduction of the new instruction will not speed up
execution of tuple building itself, but it will be less work for
the loader to load the new instruction.
|
|
v3_codegen is replaced by three new passes:
* beam_kernel_to_ssa which translates the Kernel Erlang format
to a new SSA-based intermediate format.
* beam_ssa_pre_codegen which prepares the SSA-based format
for code generation, including register allocation. Registers
are allocated using the linear scan algorithm.
* beam_ssa_codegen which generates BEAM assembly code from the
SSA-based format.
It easier and more effective to optimize the SSA-based format before X
and Y registers have been assigned. The current optimization passes
constantly have to make sure no "holes" in the X register assignments
are created (that is, that no X register becomes undefined that an
allocation instruction depends on).
This commit also introduces the following optimizations:
* Replacing of tuple matching of records with the is_tagged_tuple
instruction. (Replacing beam_record.)
* Sinking of get_tuple_element instructions to just before the first
use of the extracted values. As well as potentially avoiding
extracting tuple elements when they are not actually used on all
executions paths, this optimization could also reduce the number
values that will need to be stored in Y registers. (Similar to
beam_reorder, but more effective.)
* Live optimizations, removing the definition of a variable that is
not subsequently used (provided that the operation has no side
effects), as well strength reduction of binary matching by replacing
the extraction of value from a binary with a skip instruction. (Used
to be done by beam_block, beam_utils, and v3_codegen.)
* Removal of redundant bs_restore2 instructions. (Formerly done
by beam_bs.)
* Type-based optimizations across branches. More effective than
the old beam_type pass that only did type-based optimizations in
basic blocks.
* Optimization of floating point instructions. (Formerly done
by beam_type.)
* Optimization of receive statements to introduce recv_mark and
recv_set instructions. More effective with far fewer restrictions
on what instructions are allowed between creating the reference
and entering the receive statement.
* Common subexpression elimination. (Formerly done by beam_block.)
|
|
As a preparation for replacing v3_codegen with a new code generator,
remove unsafe optimization passes. Especially the older compiler
passes have implicit assumptions about how the code is generated.
Remove the optimizations in beam_block (keep the code that creates
blocks) because they are unsafe. beam_block also calls
beam_utils:live_opt/1, which is unsafe.
Remove beam_type because it calls beam_utils:live_opt/1, and also
because it recalculates the number of heaps words and number of live
registers in allocation instructions, thus potentially hiding bugs in
other passes.
Remove beam_receive because it is unsafe.
Remove beam_record because it is the only remaining user
of beam_utils:anno_defs/1.
Remove beam_reorder because it makes much more sense to run it
as an early SSA-based optimization pass.
Remove the now unused functions in beam_utils:
anno_def/1
delete_annos/1
is_killed_block/2
live_opt/1
usage/3
Note that the following test cases will fail because of the
removed optimizations:
compile_SUITE:optimized_guards/1
compile_SUITE:bc_options/1
receive_SUITE:ref_opt/1
|
|
|
|
In the following code:
{get_tuple_element,{x,0},0,{x,1}}.
{put_tuple,2,{x,1}}.
{put,{atom,badmap}}.
{put,{x,0}}.
{move,{x,1},{x,0}}.
beam_block would move the get_tuple_element/3 instruction and eliminate the
move/2 instruction:
{put_tuple,2,{x,1}}.
{put,{atom,badmap}}.
{put,{x,0}}.
{get_tuple_element,{x,0},0,{x,0}}.
That is not correct, since the result of the tuple building in {x,1} is
now ignored.
|
|
1a029efd1ad47f started to run the beam_block pass a second time,
but it did not attempt to combine adjacent blocks.
Combining adjacent blocks leads to many more opportunities for
optimizations.
After doing some diffing in generated code, it turns out that
there is no benefit for beam_split to split out line instructions
from blocks. It seems that the only reason it was done was to
slightly simplify the implementation of the no_line_info option
in beam_clean.
|
|
As a preparation for combining blocks before running beam_block
for the second time, disable CSE for floating point operations
because it will generate invalid code.
|
|
The more aggressive optimizations of 'allocate_zero' introduced
in cb6fc15c35c7e could produce unsafe code such as the following:
{allocate,0,1}.
{bif,element,{f,0},[{integer,1},{x,0}],{x,0}}.
The code is not safe because if element/2 fails, the runtime
system may scan the stack and find garbage that looks like a
catch tag, and would most probably crash.
Fix the problem by making beam_utils:is_killed/3 be more conservative
when asked whether a Y register will be killed.
Also fix an unsafe move upwards of an allocation instruction
in beam_block.
|
|
Eliminate get_list/3 internally in the compiler
|
|
1a029efd1ad47f started to run the beam_block pass a second time.
Since it is run after introduction of the optimized floating point
instructions, it must handle those instructions correctly.
In particular, it must be careful when hoisting allocation
instructions. For example, the following code:
{test_heap,{alloc,[{words,0},{floats,1}]},5}.
.
.
.
{fmove,{fr,2},{x,0}}.
{allocate_zero,1,4}.
must not be rewritten to:
{test_heap,{alloc,[{words,0},{floats,1}]},5}.
.
.
.
{allocate_zero,1,4}.
{fmove,{fr,2},{x,0}}.
because beam_validator will not consider it safe. (The code may
actually be safe depending on what the code between the two allocation
instructions do.)
https://bugs.erlang.org/browse/ERL-555
|
|
Instructions that produce more than one result complicate
optimizations. get_list/3 is one of two instructions that
produce multiple results (get_map_elements/3 is the other).
Introduce the get_hd/2 and get_tl/2 instructions
that return the head and tail of a cons cell, respectively,
and use it internally in all optimization passes.
For efficiency, we still want to use get_list/3 if both
head and tail are used, so we will translate matching pairs
of get_hd and get_tl back to get_list instructions.
|
|
Eliminate repeated evaluation of guard BIFs and building of cons cells
in blocks. This optimization is applicable in more places than might be
expected, because code generation for binaries and record can generate
common sub expressions not visible in the original source code.
For example, consider this function:
make_binary(Term) ->
Bin = term_to_binary(Term),
Size = byte_size(Bin),
<<Size:32,Bin/binary>>.
The compiler inserts a call to byte_size/2 to calculate the size of
the binary being built:
{function, make_binary, 1, 2}.
{label,1}.
{line,...}.
{func_info,{atom,t},{atom,make_binary},1}.
{label,2}.
{allocate,0,1}.
{line,...}.
{call_ext,1,{extfunc,erlang,term_to_binary,1}}.
{line,...}.
{gc_bif,byte_size,{f,0},1,[{x,0}],{x,1}}. %Present in original code.
{line,...}.
{gc_bif,byte_size,{f,0},2,[{x,0}],{x,2}}. %Inserted by compiler.
{bs_add,{f,0},[{x,2},{integer,4},1],{x,2}}.
{bs_init2,{f,0},{x,2},0,2,{field_flags,[]},{x,2}}.
{bs_put_integer,{f,0},{integer,32},1,{field_flags,[unsigned,big]},{x,1}}.
{bs_put_binary,{f,0},{atom,all},8,{field_flags,[unsigned,big]},{x,0}}.
{move,{x,2},{x,0}}.
{deallocate,0}.
return.
Common sub expression elimination (CSE) eliminates the second call to
byte_size/2:
{function, make_binary, 1, 2}.
{label,1}.
{line,...}.
{func_info,{atom,t},{atom,make_binary},1}.
{label,2}.
{allocate,0,1}.
{line,...}.
{call_ext,1,{extfunc,erlang,term_to_binary,1}}.
{line,...}.
{gc_bif,byte_size,{f,0},1,[{x,0}],{x,1}}.
{move,{x,1},{x,2}}.
{bs_add,{f,0},[{x,2},{integer,4},1],{x,2}}.
{bs_init2,{f,0},{x,2},0,2,{field_flags,[]},{x,2}}.
{bs_put_integer,{f,0},{integer,32},1,{field_flags,[unsigned,big]},{x,1}}.
{bs_put_binary,{f,0},{atom,all},8,{field_flags,[unsigned,big]},{x,0}}.
{move,{x,2},{x,0}}.
{deallocate,0}.
return.
Note: A possible future optimization would be to include binary
construction instructions in blocks. If that is done, the
{move,{x,1},{x,2}} instruction could also be eliminated.
|
|
The folling sequence in a block:
{move,{x,1},{x,2}}.
{move,{x,2},{x,2}}.
would be incorrectly rewritten to:
{move,{x,2},{x,2}}.
(Which in turn would be optimized away a little bit later.)
|
|
When attempting to eliminate the move/2 instruction in the following
code:
{bif,self,{f,0},[],{x,0}}.
{move,{x,0},{x,1}}.
.
.
.
{put_tuple,2,{x,1}}.
{put,{atom,ok}}.
{put,{x,0}}.
beam_block would produce the following unsafe code:
{bif,self,{f,0},[],{x,1}}.
.
.
.
{put_tuple,2,{x,1}}.
{put,{atom,ok}}.
{put,{x,1}}.
It is unsafe because the tuple is self-referential.
The following code:
{put_list,{y,6},nil,{x,4}}.
{move,{x,4},{x,5}}.
{put_list,{y,1},{x,5},{x,5}}.
.
.
.
{put_tuple,2,{x,6}}.
{put,{x,4}}.
{put,{x,5}}.
would be incorrectly transformed to:
{put_list,{y,6},nil,{x,5}}.
{put_list,{y,1},{x,5},{x,5}}.
.
.
.
{put_tuple,2,{x,6}}.
{put,{x,5}}.
{put,{x,5}}.
(Both elements in the built tuple get the same value.)
|
|
Run beam_block a second time
|
|
Running beam_block again after the other optimizations have run will
give it more opportunities for optimizations. In particular, more
allocate_zero/2 instructions can be turned into allocate/2
instructions, and more get_tuple_element/3 instructions can store the
retrieved value into the correct register at once.
Out of a sample of about 700 modules in OTP, 64 modules were improved
by this commit.
|
|
In a guard, reorder two consecutive calls to the element/2 BIF that
access the same tuple and have the same failure label so that highest
index is fetched first. That will allow the second element/2 to be
replace with the slightly cheaper get_tuple_element/3 instruction.
|
|
The annotations in the optimizing passes currently looks like this:
{'%live',NumRegistersUsed,RegistersUsedBitmap}
{'%def',RegistersDefinedBitmap}
(NumRegistersUsed is no longer used.)
When I attempted to extend some optimizations, I found that I had to
add additional clauses to tolerate/handle both types of
annotations. That problem would only get worse if any more annotations
are added in the future.
To simplify annotation handling, this commit wraps both types of
annotations in a {'%anno',_} tuple:
{'%anno',{used,RegistersUsedBitmap}}
{'%anno',{def,RegistersDefinedBitmap}}
The '%live' annotation has been renamed to 'used' to make it somewhat
clearer what it means, and the unused NumRegistersUsed part of the
old annotation has been removed.
Alternatives considered: My first attempt was to wrap the annotation
in a 'set' tuple so that there would only be 'set' tuples in a block.
For example:
{set,[],[],{anno,{live,RegistersUsedBitmap}}}
It was not as convenient as expected. Annotations often need to be
handled specially from other instructions in a block. When they are
wrapped in a 'set' tuple, they can very easily be handled incorrectly
or passed on to the next pass. That causes subtle errors or worse
code, and it can be difficult to debug.
Therefore, my conclusion is that annotations should be distinct from
other instructions, to make it obvious when one have missed to handle
an annotation.
|
|
Turn more allocate_zero instructions into allocate instructions.
|
|
Delay creation of stack frames
|
|
Use annotations added by beam_utils:anno_defs/1 to move more
allocations upwards in the instruction stream. That in turn
allows us to optimize away more 'move' instructions.
|
|
|
|
The loader has a lot of fused instructions that include move S x0.
Placing them at the end of blocks makes it possible to take advantage
of this optimization more frequently.
|
|
|
|
c2035ebb8b restricted the get_map_elements instruction so that it
could only occur at the beginning of a block. It turns out that
including it anywhere in a block is unsafe.
Therefore, never put get_map_elements instruction in blocks.
(Also remove the beam_utils:join_even/2 function since it is no
longer used.)
ERL-266
|
|
beam_block has an optimization that only is safe when it is applied
immediately after code generation. That is pointed out in a comment:
NOTE: Moving allocation instructions is only safe because it is done
immediately after code generation so that we KNOW that if {x,X} is
initialized, all x registers with lower numbers are also initialized.
That assumption may not be true after other optimizations, such as
the beam_utils:live_opt/1 optimization.
The new beam_reorder pass added in OTP 19 runs before beam_block.
Therefore, the optimization is potentially unsafe. The optimization
is also unsafe if compilation is started from assembly code in a
.S file.
Rewrite the optimization to make it safe. See the newly added comment
for details.
ERL-202
|
|
Somewhat simplified, beam_block would rewrite the target for
the first instruction in this code sequence:
move x(0) => y(1)
gc_bif '+' 1 x(0) => y(0)
move y(1) => x(1)
move nil => x(0)
call 2 local_function/2
The resulting code would be:
move x(0) => x(1) %% Changed target.
gc_bif '+' 1 x(0) => y(0)
move x(1) => y(1) %% Operands swapped (see 02d6135813).
move nil => x(0)
call 2 local_function/2
The resulting code is not safe because the x(1) will be killed
by the gc_bif instruction.
7a47b20c3a cleaned up move optimizations and would reject the
optimization if the target was an X register and an allocating
instruction was found. To avoid this bug, the optimization must be
rejected even if the target is a Y register.
|
|
* henrik/update-copyrightyear:
update copyright-year
|
|
Remove the unreachable instructions after a 'raise' instruction
(e.g. a 'jump' or 'deallocate', 'return') to decrease code size.
|
|
|
|
Consider this code:
%% Start of block
get_tuple_element Tuple 0 Element
get_map_elements Fail Map [Key => Dest]
.
.
.
move Element UltimateDest
%% End of block
Fail:
%% Code that uses Element.
beam_block (more precisely, otp_tuple_element/1) would
incorrectly transform the code to this:
%% Start of block
get_map_elements Fail Map [Key => Dest]
.
.
.
get_tuple_element Tuple 0 UltimateDest
%% End of block
Fail:
%% Code that uses Element.
That is, the code at label Fail would use register Element,
which is either uninitalized or contains the wrong value.
We could fix this problem by always keeping label information
at hand when optimizing blocks so that we could check the code
at the failure label for get_map_elements. That would require
changes to beam_block and beam_utils. We might consider doing
that in the future if it turns out be worth it.
For now, I have decided that I want to keep the simplicity of blocks
(allowing them to be optimized without keeping label information).
That could be achieved by not including get_map_elements in
blocks. Another way, which I have chosen, is to only allow
get_map_elements as the first instruction in the block.
For background on the bug: c288ab8 introduced the beam_reorder pass
and 5f431276 introduced opt_tuple_element() in beam_block.
|
|
There is an optimization in beam_block to simplify a select_val
on a known boolean value. We can implement this optimization
in a cleaner way in beam_type and it will also be applicable
in more situations. (When I added the optimization to beam_type
without removing the optimization from beam_block, the optimization
was applied 66 times.)
|
|
In the future we might want to add more bit syntax optimizations,
but beam_block is already sufficiently complicated. Therefore, move
the bit syntax optimizations out of beam_block into a separate
compiler pass called beam_bs.
|
|
d0784035ab fixed a problem with register corruption. Because of
that, opt_moves/2 will never be asked to optimize instructions with
more than two destination registers. Therefore, to regain full
coverage of beam_block, remove the final clause in opt_moves/2.
|
|
|
|
Instruction get_map_elements might destroy target registers when the fail-label is taken.
Only seen for patterns with two, and only two, target registers.
Specifically: we copy one register, and then jump.
foo(A,#{a := V1, b := V2}) -> ...
foo(A,#{b := V}) -> ...
call foo(value, #{a=>whops, c=>42}).
corresponding assembler:
{test,is_map,{f,5},[{x,1}]}.
{get_map_elements,{f,7},{x,1},{list,[{atom,a},{x,1},{atom,b},{x,2}]}}.
%% if 'a' exists but not 'b' {x,1} is overwritten, jump {f,7}
{move,{integer,1},{x,0}}.
{call_only,3,{f,10}}.
{label,7}.
{get_map_elements,{f,8},{x,1},{list,[{atom,b},{x,2}]}}.
%% {x,1} (src) is read with a corrupt value
{move,{x,0},{x,1}}.
{move,{integer,2},{x,0}}.
{call_only,3,{f,10}}.
The fix is to remove 'opt_moves' pass for get_map_elements instruction
in the case of two or more destinations.
Reported-by: Valery Tikhonov
|
|
In 45f469ca0890, the BEAM loader started to use x(1023) as scratch
register for some instructions. Therefore we should not allow
x(1023) to be used in code emitted by the compiler.
|
|
Put 'try' instructions inside block to improve the optimization
of allocation instructions. Currently, the compiler only looks
at initialization of y registers inside blocks when determining
which y registers that will be "naturally" initialized.
|
|
|
|
Here is an example of a move instruction that could not be optimized
away because the {x,2} register was not killed:
get_tuple_element Reg Pos {x,2}
.
.
.
move {x,2} {y,0}
put_list {x,2} nil Any
We can do the optimization if we replace all occurrences of the {x,2}
register as a source with {y,0}:
get_tuple_element Reg Pos {y,0}
.
.
.
put_list {y,0} nil Dst
|
|
The 'move' optimization was relatively clean until GC BIFs
were introduced. Instead of re-thinking the implementation,
the existing code was fixed and patched.
The current code unsuccessfully attempts to eliminate 'move'
instructions across GC BIF and allocation instructions. We can
simplify the code if we give up as soon as we encounter any
instruction that allocates.
|
|
opt_alloc/1 makes a redundant call to opt/1. It is redundant because
the opt/1 function has already been applied to the instruction
sequence prior to calling opt_alloc/1.
|
|
|
|
Commit b76588fb5a introduced an optimization of the compile time of
huge functions with many bs_match_string instructions. The
optimization is done in two passes. The first pass coalesces adjacent
bs_match_string instructions. To avoid copying bitstrings multiple
times, the bitstrings in the instructions are combined in to a (deep)
list. The second pass goes through all instructions in the function
and combines the list of bitstrings to a single bitstring in all
bs_match_string instructions.
The second pass (fix_bs_match_string) is run on all instructions in
each function, even if there are no bs_match_instructions in the
function. While fix_bs_match_string is not a bottleneck (it is a
linear pass), its execution time is noticeable when profiling some
modules.
Move the execution of the second pass to the select_binary()
function so that it will only be executed for instructions that
do binary matching. Also take the opportunity to optimize away
uses of bs_restore2 that occour directly after a bs_save2. That
optimimization is currently done in beam_block, but it can be
done essentially for free in the same pass that fixes up
bs_match_string instructions.
|
|
Optimize away 'not' in sys_core_fold instead of in beam_block
and beam_dead, as we can do a better job in sys_core_fold.
I modified the test suite temporarily to never turn off Core Erlang
modifications and looked at the coverage. With the new optimizations
active in sys_core_fold, the code in beam_block and beam_dead did not
find a single 'not' that it could optimize. That proves that the new
optimization is at least as good as the old one. Manually, I could
also verify that the new optimization would optimize some variations
of 'not' that the old one would not handle.
|
|
As a preparation for fixing a bug, introduce a complete register
map in the '%live' annotations.
|
|
* beam_utils:joineven/1 -> beam_utils:join_even/1
* beam_utils:split_even/1 -> beam_utils:split_even/1
|