Age | Commit message (Collapse) | Author |
|
a3407eaa2104d6 eliminated the -gen_dest flag for macros in ops.tab.
It turns out that the new implementation (taking the address of the
X or X destination register) is unsafe if the destination is a Y
register and there can be a GC. The problem is that the address to
the Y register will change if there is a GC.
Fortunately, the few instructions in OTP 20 that have a general
destinations are safe. The put_list_ssd instruction never does a GC.
The bit syntax instructions that may do a GC will always store the
result to an X register.
To be completely sure, rewrite the destination register from 'd' to
'x' for the bit syntax instructions. That means that a bit syntax
instruction with a Y register destionation will abort the loading
if it is encountered.
|
|
|
|
Inroduce syntactic sugar so that we can write:
get_list xy xy xy
instead of:
get_list x x x
get_list x x y
get_list x y x
get_list x y y
get_list y x x
get_list y x y
get_list y y x
get_list y y y
|
|
Instructions that take a 'd' argument needs a -gen_dest flag in their
macros. For example:
%macro:put_list PutList -pack -gen_dest
put_list s s d
-gen_dest was needed when x(0) was stored in a register, since it is
not possible to take the address of a register. Now that x(0) is stored
in memory and we can take the address, we can eliminate gen_dest.
|
|
Rewrite the instruction stream on tagged tuple tests.
Tagged tuples means a tuple of any arity with an atom as its first element.
Typically records, ok-tuples and error-tuples.
from:
...
{test,is_tuple,Fail,[Src]}.
{test,test_arity,Fail,[Src,Sz]}.
...
{get_tuple_element,Src,0,Dst}.
...
{test,is_eq_exact,Fail,[Dst,Atom]}.
...
to:
...
{test,is_tagged_tuple,Fail,[Src,Sz,Atom]}.
...
|
|
|
|
* kvakvs/erts/gc_minor_option/OTP-11695:
erts: Fix req_system_task gc typespec
Fix process_SUITE system_task_blast and no_priority_inversion2
Option to erlang:garbage_collect to request minor (generational) GC
Conflicts:
erts/emulator/beam/erl_process.c
erts/preloaded/src/erts_internal.erl
|
|
|
|
Note: Minor GC option is a hint, and GC may still decide to run fullsweep.
Test case for major and minor gc on self
Test case for major and minor gs on some other process + async gc test check
docs fix
|
|
* bjorn/compiler/misc-opt:
v3_kernel: Construct literal lists properly
Use the register map in %live in beam_utils:is_killed_block/2
Teach beam_utils to check liveness for put_map instructions
beam_peep: Help out beam_jump
|
|
* bjorn/erts/beam_load:
Optimize get_tuple_element instructions that target Y registers
Mend beam_SUITE:packed_registers/1
Correct unpacking of 3 operands on 32-bit archictectures
Eliminate misleading #ifdef ARCH_64 in beam_opcodes.h
beam_debug: Correct masking when unpacking packed operands
|
|
Use cerl:make_list/1 instead of a home-made make_list/1 to ensure that
literal lists are constructed as literals. In a future release, we
would like to forbid in the loader construction of literal lists using
instructions like:
put_list {atom,a} [] Dst
The proper way is:
move {literal,[a]} {x,0}
Also update the comment about "put_list Const [] Dst" in ops.tab.
|
|
Several improvements in the compiler (e.g. c288ab87fd6) has
lead to an Y register being the target for get_tuple_element
instructions. Therefore, introduce i_get_tuple_element2y
that combines two consecutive get_tuple_element instructions
that target Y registers.
|
|
* henrik/update-copyrightyear:
update copyright-year
|
|
The raise/2 instruction is almost always used like this:
raise x(2) x(1)
Therefore, we can translate it to an internal i_raise/0
instruction that uses x(2) x(1) as its implicit operands.
We will also remove the backward compatibility with R10-0. It is
unlikely that anyone still is using BEAM files compiled with the R10-0
compiler, especially since most of those modules cannot be loaded. The
loader will refuse to load any module that uses the old non-GCIng
arithmetic instructions or the non-GCing versions of length/1 or
size/1.
Doing these changes will reduce both the size of the loaded BEAM
code and size of the code in process_main().
|
|
There is no reason to rename bs_put_utf16/3.
(We rename instructions if we'll need to change the operands or
if we will need to avoid an endless transformation loop. Neither
of these reasons apply to bs_put_utf16/3.)
|
|
Optimizations that are possible to do by the compiler should be
done by the compiler and not by the loader.
If the compiler has done its job correctly, attempting to do the two
transformations only wastes time.
|
|
The transformation on the following line will do the job.
|
|
62473daf introduced an unsafe optimization in the loader.
See the comments in the test case for an explanation of
the problem.
|
|
|
|
The perf_counter is a very very cheap and high resolution timer
that can be used to timestamp system events. It does not have
monoticity guarantees, but should on most OS's expose a monotonous
time.
A special instruction has been created for this counter to further
speed up fetching it.
OTP-12908
|
|
* egil/pd-opt-get/OTP-13167:
erts: Add i_get_hash instruction
erts: Use internal hash for process dictionaries
|
|
Calculate hashvalue in load-time for constant process dictionary gets.
|
|
The combination is_non_empty_list followed by get_list is extremly
common (but not in estone_SUITE, which is why it has not been noticed
before). Therefore it is worthwile to introduce a combined
instruction.
|
|
Not pre-fetching in conditional instructions (instructions that use
-fail_action) seems to improve performance slightly.
The reason for that is that conditional instructions may jump to the
failure label, wasting the pre-fetched instruction. Another reason
is that that many conditional instructions do a function call, and
the register pressure is therefore high. Avoiding the pre-fetch
may reduce the register pressure and pontentially result in more
efficient code.
|
|
It is currently only possible to pack up to 4 operands. However,
the move_window4 instrucion has 5 operands and move_window5 and
move3 instrucations have 6 operands.
Teach beam_makeops to pack instructions with 5 or 6 operands.
Also rewrite the move_window instructions in beam_emu.c to macros
to allow their operands to get packed.
|
|
Update transformations to ensure that the move_call_ext_last
and move_call_ext_last are used.
|
|
Sequences of three move instructionst that effectively swap the
contents of two registers are fairly common. We can replace them
with a swap_temp/3 instruction. The third operand is the temporary
register to be used for swapping, since the temporary register
may actually be used.
If swap_temp/3 instruction is followed by a call, the temporary
register will often (but not always) be killed by the call. If
it is killed, we can replace the swap_temp/3 instruction with a
slightly cheaper swap/2 instruction.
|
|
Currently, move2/2 does the two moves sequentially to ensure
that the instruction will always work correctly.
We can do better than that. If the two move instructions have
any registers in common, we can introduce simpler and slightly
more efficient instructions to handle those cases:
move_shift/3
move_dup/3
For the remaining cases when the the move instructions
have no common registers, the move2/4 instruction can perform
the moves in parallel which is probably slightly more efficient.
For clarity's sake, we will remain the instruction to move2_par/4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The i_fetch instruction fetches two operands and places them
in the tmp_arg1 and tmp_arg2 variables. The next instruction
(such as i_plus) does not have to handle different types of
operands, but can get get them simply from the tmp_arg*
variables. Thus, i_fetch was introduced as a way to temper
a potentail combinatorial explosion.
Unfortunately, clang will generate terrible code because of
the tmp_arg1 and tmp_arg2 variables being live across multiple
instructions. Note that Clang has no way to predict the control
flow from one instruction to another. Clang must assume that
any instruction can jump to any other instruction. Somehow GCC
manages to cope with this situation much better.
Therefore, to improve the quality of the code generated by clang, we
must eliminate all uses of the tmp_arg1 and tmp_arg2 variables. This
commit eliminates the use of i_fetch in combination with the
arithmetic and logical instructions.
While we are touching the code for the bsr and bsl instructions,
also move the tmp_big[] array from top scope of process main into
the block that encloses the bsr and bsl instructions.
|
|
The 'r' type is now mandatory. That means in order to handle
both of the following instructions:
move x(0) y(7)
move x(1) y(7)
we would need to define two specific operations in ops.tab:
move r y
move x y
We want to make 'r' operands optional. That is, if we have
only this specific instruction:
move x y
it will match both of the following instructions:
move x(0) y(7)
move x(1) y(7)
Make 'r' optional allows us to save code space when we don't
want to make handling of x(0) a special case, but we can still
use 'r' to optimize commonly used instructions.
|
|
Consider the try_case_end instruction:
try_case_end s
The 's' operand type means that the operand can either be a
literal of one of the types atom, integer, or empty list, or
a register. That worked well before R12. In R12 additional
types of literals where introduced. Because of way the
overloading was done, an 's' operand cannot handle the
new types of literals. Therefore, code such as the following
is necessary in ops.tab to avoid giving an 's' operand a
literal:
try_case_end Literal=q => move Literal x | try_case_end x
While this work, it is error-prone in that it is easy to
forget to add that kind of rule. It would also be complicated
in case we wanted to introduce a new kind of addition operator
such as:
i_plus jssd
Since there are two 's' operands, two scratch registers and
two 'move' instructions would be needed.
Therefore, we'll need to find a smarter way to find tag
register operands. We will overload the pid and port tags
for X and Y register, respectively. That works because pids
and port are immediate values (fit in one word), and there
are no literals for pids and ports.
|
|
|
|
Seen on SSL application where substraction with x registers were prevalent:
* i_minus specialization on x registers
* i_plus specialization on x registers
|
|
Common pattern seen in SSL:
move y x | move r x -> move2
move r x | move y x -> move2
Common pattern seen in SSL and Compiler:
move x r | move x x -> move2
|
|
* i_rem specialization on x registers
|
|
* i_band specialization on x registers and constants
|
|
Move an entire region of x registers to the stack.
This reduces the dispatch pressure of move instructions.
Also introduce a move2 specialization for some common move patterns:
move r y | move x y -> move2 : As above, moving regions to the stack
move x r | move x y -> move2 : A seemingly common pattern
|
|
|
|
* i_is_lt for r, x registers and constants
* i_is_ge for x registers and constants
* i_is_exact_eq for r and x registers
|
|
See the previous commit for justification and use cases.
|
|
Let the loader pre-compute the hash value when a single, literal key
is matched as in:
#{<<"some_key">>:=V} = Map
In my measurements, this optimization resulted in a 30 percent
speedup for short binary keys.
Unfortunately, this optimizization makes no difference for small
maps with less than 32 keys, since the hash value is not used.
Still, there are the following use cases:
* A map used instead of a record with more than 32 entries. I have
seen some applications with huge records.
* Lookup in JSON dictionaries represented as maps.
The hash value will only be used when the map is a hash map
(currently, that means at least 32 entries).
|
|
In the i_get_map_element/4 instruction, for literal keys other than
atoms, the key would be put into x[0] instead of used directly in the
instruction. The reason is that the original implementation of maps
only supported atom keys.
|
|
The map instructions require that the keys in the instructions
are sorted (for flatmaps). But that is an implementation detail
that should not exposed outside of the BEAM virtual machine.
Therefore, make the sorting of the keys the responsibility of
the loader and not the compiler.
Also note that the sort order for maps with numeric keys or keys
with numeric components has changed in OTP 18. That means that
code compiled for OTP 17 that operated on maps with map keys
might not work in OTP 18 without the sorting in the loader
(although it is unlikely to be an issue in practice).
|