Age | Commit message (Collapse) | Author |
|
* jv/erts/optimize-cmp:
Unify comparison macros in erl_utils.h
Avoid erts_cmp jump in atom, int and float comparisons
|
|
Given the function definition below:
check(X) when X >= 0, X <= 20 -> true.
@nox has originally noticed that perfoming lt and ge
guard tests were performing slower than they should be.
Further investigation revealed that most of the cost
was in jumping to the erts_cmp function. This patch
brings the operations already inlined in erts_cmp
into the emulator, removing the jump cost.
After applying these changes, invoking the check/1
function defined above 30000 times with different
values from 0 to 20 has fallen from 367us to 213us
(measured as average of 3 runs). This is a
considerably improvement over Erlang 18 which takes
556us on average.
Floats have also dropped their time from 1126us
(on Erlang 18) to 613us.
|
|
* lukas/erts/msacc:
Update preloaded modules
erts: Make msacc alloctor type thread safe
Silence compiler
erts: Fix msacc testcase on some windowses
erts: Add power saving cpu feature tests and use them
erts: Refactor perf counter internal interface
erts: Add rdtscp instruction check
erts: Fix hrtime for windows
erts: use correct function for perf counter on non-x86
erts: Fix msacc win32 debug compile error
erts: Add microstate accounting
erts, kernel: Add os:perf_counter function
erts: Add ERTS_WRITE_UNLIKELY
|
|
Conflicts:
erts/emulator/beam/beam_emu.c
|
|
perf counter is now part of the function pointer interface
and also the function returns the value instead of writing
to a memory buffer.
|
|
|
|
Microstate accounting is a way to track which state the
different threads within ERTS are in. The main usage area
is to pin point performance bottlenecks by checking which
states the threads are in and then from there figuring out
why and where to optimize.
Since checking whether microstate accounting is on or off is
relatively expensive if done in a short loop only a few of the
states are enabled by default and more states can be enabled
through configure.
I've done some benchmarking and the overhead with it turned off
is not noticible and with it on it is a fraction of a percent.
If you enable the extra states, depending on the benchmark,
the ovehead when turned off is about 1% and when turned on
somewhere inbetween 5-15%.
OTP-12345
|
|
The perf_counter is a very very cheap and high resolution timer
that can be used to timestamp system events. It does not have
monoticity guarantees, but should on most OS's expose a monotonous
time.
A special instruction has been created for this counter to further
speed up fetching it.
OTP-12908
|
|
* egil/pd-opt-get/OTP-13167:
erts: Add i_get_hash instruction
erts: Use internal hash for process dictionaries
|
|
* rickard/ohmq-fixup/OTP-13047:
Replace off_heap_message_queue option with message_queue_data option
Always use literal_alloc
Distinguish between GC disabled by BIFs and other disabled GC
Fix process_info(_, off_heap_message_queue)
Off heap message queue test suite
Remove unused variable
Fix memory leaks
|
|
Processes remember heap fragments that are known to be fully
live due to creation in a just called BIF that yields in the
live_hf_end field. This field must not be used if we have not
disabled GC in a BIF. F_DELAY_GC has been introduced in order
to distinguish between to two different scenarios.
- F_DISABLE_GC should *only* be used by BIFs. This when
the BIF needs to yield while preventig a GC.
- F_DELAY_GC should only be used when GC is temporarily
disabled while the process is scheduled. A process must
not be scheduled out while F_DELAY_GC is set.
|
|
Calculate hashvalue in load-time for constant process dictionary gets.
|
|
The test whether the result would fit in a smallnum could overflow into
a negative number that would fit a smallnum. A test that reproduces the
issue was added to bs_construct_SUITE.
|
|
|
|
|
|
* rickard/gc-bump-reds/OTP-13097:
Bump reductions on GC
|
|
* rickard/gc-after-bif-cond/OTP-13098:
Use the same conditions when triggering GC after BIF
|
|
* rickard/ohmq/OTP-13047:
Fragmented young heap generation and off_heap_message_queue option
Refactor GC
Introduce literal tag
Conflicts:
erts/doc/src/erlang.xml
erts/emulator/beam/erl_gc.c
|
|
* sverk/literal-memory-range:
erts: Refactor line table in loaded beam code
erts: Refactor header of loaded beam code
fix check_process_code for separate literal area
erts: Add support for fast erts_is_literal()
erts: Refactor erl_mmap to allow several mapper instances
erts: Add new allocator LITERAL
erts: Fix strangeness in treatment of MSEG_ALIGN_BITS
erts: Cleanup main carrier creation
erts: Remove unused erts_have_erts_mmap
erts: Refactor config test for posix_memalign
|
|
|
|
|
|
* The youngest generation of the heap can now consist of multiple
blocks. Heap fragments and message fragments are added to the
youngest generation when needed without triggering a GC. After
a GC the youngest generation is contained in one single block.
* The off_heap_message_queue process flag has been added. When
enabled all message data in the queue is kept off heap. When
a message is selected from the queue, the message fragment (or
heap fragment) containing the actual message is attached to the
youngest generation. Messages stored off heap is not part of GC.
|
|
to use a real C struct instead of array.
|
|
|
|
erlang:is_builtin(erlang, apply, 3) returns 'false'. That seems to be
an oversight in the implementation of erlang:is_builtin/3 rather than
a conscious design decision. Part of apply/3 is implemented in C (as a
special instruction), and part of it in Erlang (only used if apply/3
is used recursively). That makes apply/3 special compared to all other
BIFs.
From the viewpoint of the user, apply/3 is a built-in function,
since it cannot possibly be implemented in pure Erlang.
Noticed-by: Stavros Aronis
|
|
|
|
Conflicts:
OTP_VERSION
erts/doc/src/notes.xml
erts/vsn.mk
lib/runtime_tools/doc/src/notes.xml
lib/runtime_tools/vsn.mk
otp_versions.table
|
|
|
|
Fetch the head and tail parts to temporary variables before
writing them to their destinations. That should allow the CPU to
perform the moves in parallel, which might improve performance.
|
|
The combination is_non_empty_list followed by get_list is extremly
common (but not in estone_SUITE, which is why it has not been noticed
before). Therefore it is worthwile to introduce a combined
instruction.
|
|
|
|
It is currently only possible to pack up to 4 operands. However,
the move_window4 instrucion has 5 operands and move_window5 and
move3 instrucations have 6 operands.
Teach beam_makeops to pack instructions with 5 or 6 operands.
Also rewrite the move_window instructions in beam_emu.c to macros
to allow their operands to get packed.
|
|
Since 'd' operands can only either an X register or an Y register,
we only need a single bit to distinguish them. Furthermore, we can
pre-multiply the register number with the word size to speed up
address calculation.
|
|
Sequences of three move instructionst that effectively swap the
contents of two registers are fairly common. We can replace them
with a swap_temp/3 instruction. The third operand is the temporary
register to be used for swapping, since the temporary register
may actually be used.
If swap_temp/3 instruction is followed by a call, the temporary
register will often (but not always) be killed by the call. If
it is killed, we can replace the swap_temp/3 instruction with a
slightly cheaper swap/2 instruction.
|
|
Currently, move2/2 does the two moves sequentially to ensure
that the instruction will always work correctly.
We can do better than that. If the two move instructions have
any registers in common, we can introduce simpler and slightly
more efficient instructions to handle those cases:
move_shift/3
move_dup/3
For the remaining cases when the the move instructions
have no common registers, the move2/4 instruction can perform
the moves in parallel which is probably slightly more efficient.
For clarity's sake, we will remain the instruction to move2_par/4.
|
|
|
|
The 'cmd' variable that were shared by several hipe_mode_switch
instructions would cause clang to produce sub-optimal code,
probably because it considered the instructions as part of of
loop that needed to be optimized.
What would was that 'cmd' would be assigned to the ESI register
(lower 32 bits of the RSI register). It would use ESI for other
purposes in instructions, but at the end of every instruction
it would set ESI to 1 just in case the next instruction happened
to be hipe_trap_return. This can be seen clearly if this commit
is omitted and the define HIPE_MODE_SWITCH_CMD_RETURN in
hipe/hipe_mode_switch.h is changed from 1 to some other number
such as 42. You will see that 42 is assigned to ESI at the end
of every instruction.
Eliminate this problem by elimininating the shared 'cmd' variable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The i_fetch instruction fetches two operands and places them
in the tmp_arg1 and tmp_arg2 variables. The next instruction
(such as i_plus) does not have to handle different types of
operands, but can get get them simply from the tmp_arg*
variables. Thus, i_fetch was introduced as a way to temper
a potentail combinatorial explosion.
Unfortunately, clang will generate terrible code because of
the tmp_arg1 and tmp_arg2 variables being live across multiple
instructions. Note that Clang has no way to predict the control
flow from one instruction to another. Clang must assume that
any instruction can jump to any other instruction. Somehow GCC
manages to cope with this situation much better.
Therefore, to improve the quality of the code generated by clang, we
must eliminate all uses of the tmp_arg1 and tmp_arg2 variables. This
commit eliminates the use of i_fetch in combination with the
arithmetic and logical instructions.
While we are touching the code for the bsr and bsl instructions,
also move the tmp_big[] array from top scope of process main into
the block that encloses the bsr and bsl instructions.
|
|
The 'r' type is now mandatory. That means in order to handle
both of the following instructions:
move x(0) y(7)
move x(1) y(7)
we would need to define two specific operations in ops.tab:
move r y
move x y
We want to make 'r' operands optional. That is, if we have
only this specific instruction:
move x y
it will match both of the following instructions:
move x(0) y(7)
move x(1) y(7)
Make 'r' optional allows us to save code space when we don't
want to make handling of x(0) a special case, but we can still
use 'r' to optimize commonly used instructions.
|
|
Consider the try_case_end instruction:
try_case_end s
The 's' operand type means that the operand can either be a
literal of one of the types atom, integer, or empty list, or
a register. That worked well before R12. In R12 additional
types of literals where introduced. Because of way the
overloading was done, an 's' operand cannot handle the
new types of literals. Therefore, code such as the following
is necessary in ops.tab to avoid giving an 's' operand a
literal:
try_case_end Literal=q => move Literal x | try_case_end x
While this work, it is error-prone in that it is easy to
forget to add that kind of rule. It would also be complicated
in case we wanted to introduce a new kind of addition operator
such as:
i_plus jssd
Since there are two 's' operands, two scratch registers and
two 'move' instructions would be needed.
Therefore, we'll need to find a smarter way to find tag
register operands. We will overload the pid and port tags
for X and Y register, respectively. That works because pids
and port are immediate values (fit in one word), and there
are no literals for pids and ports.
|
|
|
|
As part of improving code generation for clang, we want to
eliminate the special variable that stores the content of X
register zero most of the time. In a future, that will allow us
to eliminate the special case of handling r(0) for most
instructions, thus reducing the code size and allow other
simplifcations.
Therefore, in this commit, eliminate the variable that is used
to store r(0) and make r(0) as synonym for x(0). I have chosen
to keep the r(0) define to keep the size of the diff managable.
|
|
|
|
|