Age | Commit message (Collapse) | Author |
|
This commit only affects the implementation of ETS `ordered_set`
tables with the `write_concurrency` option enabled. Such tables are
implemented with a data structure that is called the contention
adapting search tree (CA tree). This commit introduces the following
changes:
* This commit causes a join to be triggered in one randomly selected
base node in about one of 1000 read unlock calls for base node
locks. No such joins happened before this commit. Before this
commit, operations that only acquired looks in read-mode never
triggered any contention adaptation. Therefore, the CA tree could
get stuck in a sub-optimal state in certain scenarios. This could
happen, for example, when a CA tree is first populated with parallel
inserts (which will cause splits of base nodes) and then only
read-only operations are applied to the data structure. Benchmark
results from the
`ets_SUITE:lookup_catree_par_vs_seq_init_benchmark/0` benchmark
function (which is included in this commit) shows that this change
can improve the throughput of the CA tree in the scenario described
above.
* Read-only operations will now also increase values of statistics
counters when they detect that they need to wait for other
operations. Only write operation changed statistics counters before
this commit. This improves the statistics that the adaptation
heuristics is based on.
* Additionally, this commit adds an upper and lower limit to the
contention statistics variables in the base nodes. Such limits did
not exist before this commit. This should, for example, make the CA
tree more responsive to contention after long periods of low
contention.
|
|
This commit fixes an ETS test case that tests the decentralized memory
counter in tables of type ordered_set with the write_concurrency
option turned on. The test case assumed that the memory consumption of
the table would only grow monotonically when terms are
inserted. However, this was not the case when the emulator was
compiled in debug mode as random splits and joins of CA tree nodes
could happen. This commit fixes the test case by disabling random
splits and joins in the tested table.
|
|
* sverker/test-cuddle:
stdlib: Remove ets_SUITE:time_lookup
|
|
Previously, all ETS tables used centralized counter variables to keep
track of the number of items stored and the amount of memory
consumed. These counters can cause scalability problems (especially on
big NUMA systems). This commit adds an implementation of a
decentralized counter and modifies the implementation of ETS so that
ETS tables of type ordered_set with write_concurrency enabled use the
decentralized counter. [Experiments][1] indicate that this change
substantially improves the scalability of ETS ordered_set tables with
write_concurrency enabled in scenarios with frequent `ets:insert/2`
and `ets:delete/2` calls.
The new counter is implemented in the module erts_flxctr
(`erts_flxctr.h` and `erts_flxctr.c`). The module has the suffix
flxctr as it contains the implementation of a flexible counter (i.e.,
counter instances can be configured to be either centralized or
decentralized). Counters that are configured to be centralized are
implemented with a single counter variable which is modified with
atomic operations. Decentralized counters are spread over several
cache lines (how many can be configured with the parameter
`+dcg`). The scheduler threads are mapped to cache lines so that there
is no single point of contention when decentralized counters are
updated. The thread progress functionality of the Erlang VM is
utilized to implement support for linearizable snapshots of
decentralized counters. The snapshot functionality is used by the
`ets:info/1` and `ets:info/2` functions.
[1]: http://winsh.me/ets_catree_benchmark/flxctr_res.html
|
|
Fails sometimes on windows due to bad timer precision
leading to division by zero.
This is more a (bad) benchmark than a regression test.
|
|
* Refactor the code to make it easier to configure the benchmark
* Add a test case for long benchmark runs. The new test case is run by
the OTP-team's benchmark infrastructure and can help in keeping
track of how the performance of ETS is affected by code changes.
|
|
Symtom:
ETS table remains fixed after finished ets:select* call.
Problem:
The decision to unfix table after a yielding ets:select*
is based on table ownership, but ownership might have changed
while ets:select* was yielding.
Solution:
Remember and pass along whether table was fixed
when the traversal started.
|
|
into sverker/master/ets-no-mbuf-trapping/OTP-15660
|
|
into sverker/maint/ets-no-mbuf-trapping/OTP-15660
|
|
Many heap fragments do no longer make the GC slow.
Even worse, we are not guaranteed that a yield will provoke a GC
removing the fragments, which might lead to a one-yield-per-bucket
scenario if the heap fragment(s) still remains after each yield.
|
|
Did fail on really slow unlucky machines.
|
|
Also fix erts_debug:get_internal_status(node_and_dist_references)
for catree to also search route node keys for offheap stuff.
|
|
with keys containing off-heap terms.
The passed key may actually be the one already saved
(if nodes have been joined), in which case we do nothing.
Calling destroy_route_key() may destroy off-heap data.
|
|
to provoke iteration over a moving ordered_set with write_concurrency
and make sure we hit all "stable" keys.
|
|
to generate a routing tree with keys that fit each test case.
|
|
The original implementation did not do this due to fear of bad
performance. But we think the negative effect of "leaking" empty
base nodes is more important to fix.
To get the bad performance a special kind of access patterns is
needed where base nodes are frequently emptied and then repopulated
soon again. ets_SUITE:throughput_benchmark for example did not show
any negative effect from this commit at all.
|
|
|
|
Once an iteration key has been found, never fall back to first/last key in
next/prev tree as trees may split or join under our feet. I.e we must always
use previous key when searching for the next key.
|
|
with ets_force_split
|
|
{RouteNodes, BaseNodes, MaxRouteTreeDepth}
|
|
# Conflicts:
# erts/emulator/beam/erl_db_tree.c
|
|
* sverker/erts/ets-select_replace-bug/OTP-15346:
erts: Fix bug in ets:select_replace for bound key
|
|
which may cause following calls to ets:next or ets:prev to fail.
|
|
meta_wb
smp_insert
smp_fixed_delete
smp_select_delete
|
|
to reduce test times
|
|
Avoid repeating same tests for [] and [set].
Test case 'default' verifies 'set' to be the default type.
|
|
by populating the table with the help of a random number generator
creating series of unique integers.
|
|
of repeated table opts
and waiting for workers
|
|
The current ETS ordered_set implementation can quickly become a
scalability bottleneck on multicore machines when an application updates
an ordered_set table from concurrent processes [1][2]. The current
implementation is based on an AVL tree protected from concurrent writes
by a single readers-writer lock. Furthermore, the current implementation
has an optimization, called the stack optimization [3], that can improve
the performance when only a single process accesses a table but can
cause bad scalability even in read-only scenarios. It is possible to
pass the option {write_concurrency, true} to ets:new/2 when creating an
ETS table of type ordered_set but this option has no effect for tables
of type ordered_set without this commit. The new ETS ordered_set
implementation, added by this commit, is only activated when one passes
the options ordered_set and {write_concurrency, true} to the ets:new/2
function. Thus, the previous ordered_set implementation (from here on
called the default implementation) can still be used in applications
that do not benefit from the new implementation. The benchmark results
on the following web page show that the new implementation is many times
faster than the old implementation in some scenarios and that the old
implementation is still better than the new implementation in some
scenarios.
http://winsh.me/ets_catree_benchmark/ets_ca_tree_benchmark_results.html
The new implementation is expected to scale better than the default
implementation when concurrent processes use the following ETS
operations to operate on a table:
delete/2, delete_object/2, first/1, insert/2 (single object),
insert_new/2 (single object), lookup/2, lookup_element/2, member/2,
next/2, take/2 and update_element/3 (single object).
Currently, the new implementation does not have scalable support for the
other operations (e.g., select/2). However, when these operations are
used infrequently, the new implantation may still scale better than the
default implementation as the benchmark results at the URL above shows.
Description of the New Implementation
----------------------------------
The new implementation is based on a data structure which is called the
contention adapting search tree (CA tree for short). The following
publication contains a detailed description of the CA tree:
A Contention Adapting Approach to Concurrent Ordered Sets
Journal of Parallel and Distributed Computing, 2018
Kjell Winblad and Konstantinos Sagonas
https://doi.org/10.1016/j.jpdc.2017.11.007
http://www.it.uu.se/research/group/languages/software/ca_tree/catree_proofs.pdf
A discussion of how the CA tree can be used as an ETS back-end can be
found in another publication [1]. The CA tree is a data structure that
dynamically changes its synchronization granularity based on detected
contention. Internally, the CA tree uses instances of a sequential data
structure to store items. The CA tree implementation contained in this
commit uses the same AVL tree implementation as is used for the default
ordered set implementation. This AVL tree implementation is reused so
that much of the existing code to implement the ETS operations can be
reused.
Tests
-----
The ETS tests in `lib/stdlib/test/ets_SUITE.erl` have been extended to
also test the new ordered_set implementation. The function
ets_SUITE:throughput_benchmark/0 has also been added to this file. This
function can be used to measure and compare the performance of the
different ETS table types and options. This function writes benchmark
data to standard output that can be visualized by the HTML page
`lib/stdlib/test/ets_SUITE_data/visualize_throughput.html`.
[1]
More Scalable Ordered Set for ETS Using Adaptation.
In Thirteenth ACM SIGPLAN workshop on Erlang (2014).
Kjell Winblad and Konstantinos Sagonas.
https://doi.org/10.1145/2633448.2633455
http://www.it.uu.se/research/group/languages/software/ca_tree/erlang_paper.pdf
[2]
On the Scalability of the Erlang Term Storage
In Twelfth ACM SIGPLAN workshop on Erlang (2013)
Kjell Winblad, David Klaftenegger and Konstantinos Sagonas
https://doi.org/10.1145/2505305.2505308
http://winsh.me/papers/erlang_workshop_2013.pdf
[3]
The stack optimization works by keeping one preallocated stack instance
in every ordered_set table. This stack is updated so that it contains
the search path in some read operations (e.g., ets:next/2). This makes
it possible for a subsequent ets:next/2 to avoid traversing some nodes
in some cases. Unfortunately, the preallocated stack needs to be flagged
so that it is not updated concurrently by several threads which cause
bad scalability.
|
|
|
|
causing erlang:memory to report too much ets memory.
|
|
|
|
|
|
|
|
by using a cooperative strategy that will make
any process accessing the table execute delelete_all_objects_continue
until the table is empty.
This is not an optimal solution as concurrent threads will still
block on the table lock, but at least thread progress is made.
|
|
and not the name. For more sane named table semantics.
Applies to both select/1 continuation and trap context.
|
|
Communication between Erlang processes has conceptually always been
performed through asynchronous signaling. The runtime system
implementation has however previously preformed most operation
synchronously. In a system with only one true thread of execution, this
is not problematic (often the opposite). In a system with multiple threads
of execution (as current runtime system implementation with SMP support)
it becomes problematic. This since it often involves locking of structures
when updating them which in turn cause resource contention. Utilizing
true asynchronous communication often avoids these resource contention
issues.
The case that triggered this change was contention on the link lock due
to frequent updates of the monitor trees during communication with a
frequently used server. The signal order delivery guarantees of the
language makes it hard to change the implementation of only some signals
to use true asynchronous signaling. Therefore the implementations
of (almost) all signals have been changed.
Currently the following signals have been implemented as true
asynchronous signals:
- Message signals
- Exit signals
- Monitor signals
- Demonitor signals
- Monitor triggered signals (DOWN, CHANGE, etc)
- Link signals
- Unlink signals
- Group leader signals
All of the above already defined as asynchronous signals in the
language. The implementation of messages signals was quite
asynchronous to begin with, but had quite strict delivery constraints
due to the ordering guarantees of signals between a pair of processes.
The previously used message queue partitioned into two halves has been
replaced by a more general signal queue partitioned into three parts
that service all kinds of signals. More details regarding the signal
queue can be found in comments in the erl_proc_sig_queue.h file.
The monitor and link implementations have also been completely replaced
in order to fit the new asynchronous signaling implementation as good
as possible. More details regarding the new monitor and link
implementations can be found in the erl_monitor_link.h file.
|
|
|
|
into 'sverker/master/alloc-n-migration/ERIERL-88'
|
|
into 'sverker/maint-20/alloc-n-migration/ERIERL-88'
OTP-14915
OTP-14916
OTP-14917
OTP-14918
|
|
as memory stats do not guarantee consistency.
A typical ETS test case ends by a lot of deallocating
that may now trigger homecoming carrier migration,
that in turn can cause quite large inconsistencies
in memory stats when same carrier is accounted for twice
or not at all.
And that's my theory why I now sometimes see transient discrepancies
between before and after memory stats.
|
|
The efile driver will soon be reimplemented as a BIF.
Instead of opening a port based on efile, use hd(erlang:ports()). It
is a reasonable safe assumption that the runtime will continue to use
use at least some ports.
|
|
|
|
|
|
* sverker/ets-select-replace-const:
stdlib: Add examples for ets:select_replace docs
erts: Fix ets:select_replace with {const, NewTuple}
|
|
Enable ets:select_replace to do a generic single object
compare-and-swap operation of any ets-tuple using
a matchspec like this:
[{Old, [], [{const, New}]}]
The only exception when this does not work is if the key
contains maps or atoms looking like variables (like '$1').
|
|
|
|
to run 3*3 seconds to avoid timeout on slow machines.
|
|
|
|
run_workers/* -> run_smp_workers/*
run_workers_do/4 -> run_sched_workers/4
|