diff options
author | Kjell Winblad <[email protected]> | 2018-09-05 21:45:57 +0200 |
---|---|---|
committer | Kjell Winblad <[email protected]> | 2018-09-05 21:46:40 +0200 |
commit | 2a8e00ad72f5a0a9c73d558f247c23d27d6ffd5b (patch) | |
tree | c5fa25cd6618c3aab76ea0676ee9ada87316c8c3 /erts/emulator/beam/erl_db_tree_util.h | |
parent | 26e03d10c6c51868640869da8b091efdeab28bb0 (diff) | |
download | otp-2a8e00ad72f5a0a9c73d558f247c23d27d6ffd5b.tar.gz otp-2a8e00ad72f5a0a9c73d558f247c23d27d6ffd5b.tar.bz2 otp-2a8e00ad72f5a0a9c73d558f247c23d27d6ffd5b.zip |
Add a more scalable ETS ordered_set implementation
The current ETS ordered_set implementation can quickly become a
scalability bottleneck on multicore machines when an application updates
an ordered_set table from concurrent processes [1][2]. The current
implementation is based on an AVL tree protected from concurrent writes
by a single readers-writer lock. Furthermore, the current implementation
has an optimization, called the stack optimization [3], that can improve
the performance when only a single process accesses a table but can
cause bad scalability even in read-only scenarios. It is possible to
pass the option {write_concurrency, true} to ets:new/2 when creating an
ETS table of type ordered_set but this option has no effect for tables
of type ordered_set without this commit. The new ETS ordered_set
implementation, added by this commit, is only activated when one passes
the options ordered_set and {write_concurrency, true} to the ets:new/2
function. Thus, the previous ordered_set implementation (from here on
called the default implementation) can still be used in applications
that do not benefit from the new implementation. The benchmark results
on the following web page show that the new implementation is many times
faster than the old implementation in some scenarios and that the old
implementation is still better than the new implementation in some
scenarios.
http://winsh.me/ets_catree_benchmark/ets_ca_tree_benchmark_results.html
The new implementation is expected to scale better than the default
implementation when concurrent processes use the following ETS
operations to operate on a table:
delete/2, delete_object/2, first/1, insert/2 (single object),
insert_new/2 (single object), lookup/2, lookup_element/2, member/2,
next/2, take/2 and update_element/3 (single object).
Currently, the new implementation does not have scalable support for the
other operations (e.g., select/2). However, when these operations are
used infrequently, the new implantation may still scale better than the
default implementation as the benchmark results at the URL above shows.
Description of the New Implementation
----------------------------------
The new implementation is based on a data structure which is called the
contention adapting search tree (CA tree for short). The following
publication contains a detailed description of the CA tree:
A Contention Adapting Approach to Concurrent Ordered Sets
Journal of Parallel and Distributed Computing, 2018
Kjell Winblad and Konstantinos Sagonas
https://doi.org/10.1016/j.jpdc.2017.11.007
http://www.it.uu.se/research/group/languages/software/ca_tree/catree_proofs.pdf
A discussion of how the CA tree can be used as an ETS back-end can be
found in another publication [1]. The CA tree is a data structure that
dynamically changes its synchronization granularity based on detected
contention. Internally, the CA tree uses instances of a sequential data
structure to store items. The CA tree implementation contained in this
commit uses the same AVL tree implementation as is used for the default
ordered set implementation. This AVL tree implementation is reused so
that much of the existing code to implement the ETS operations can be
reused.
Tests
-----
The ETS tests in `lib/stdlib/test/ets_SUITE.erl` have been extended to
also test the new ordered_set implementation. The function
ets_SUITE:throughput_benchmark/0 has also been added to this file. This
function can be used to measure and compare the performance of the
different ETS table types and options. This function writes benchmark
data to standard output that can be visualized by the HTML page
`lib/stdlib/test/ets_SUITE_data/visualize_throughput.html`.
[1]
More Scalable Ordered Set for ETS Using Adaptation.
In Thirteenth ACM SIGPLAN workshop on Erlang (2014).
Kjell Winblad and Konstantinos Sagonas.
https://doi.org/10.1145/2633448.2633455
http://www.it.uu.se/research/group/languages/software/ca_tree/erlang_paper.pdf
[2]
On the Scalability of the Erlang Term Storage
In Twelfth ACM SIGPLAN workshop on Erlang (2013)
Kjell Winblad, David Klaftenegger and Konstantinos Sagonas
https://doi.org/10.1145/2505305.2505308
http://winsh.me/papers/erlang_workshop_2013.pdf
[3]
The stack optimization works by keeping one preallocated stack instance
in every ordered_set table. This stack is updated so that it contains
the search path in some read operations (e.g., ets:next/2). This makes
it possible for a subsequent ets:next/2 to avoid traversing some nodes
in some cases. Unfortunately, the preallocated stack needs to be flagged
so that it is not updated concurrently by several threads which cause
bad scalability.
Diffstat (limited to 'erts/emulator/beam/erl_db_tree_util.h')
-rw-r--r-- | erts/emulator/beam/erl_db_tree_util.h | 151 |
1 files changed, 151 insertions, 0 deletions
diff --git a/erts/emulator/beam/erl_db_tree_util.h b/erts/emulator/beam/erl_db_tree_util.h new file mode 100644 index 0000000000..fbd9a9124a --- /dev/null +++ b/erts/emulator/beam/erl_db_tree_util.h @@ -0,0 +1,151 @@ +/* + * %CopyrightBegin% + * + * Copyright Ericsson AB 1998-2016. All Rights Reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * %CopyrightEnd% + */ + +#ifndef _DB_TREE_UTIL_H +#define _DB_TREE_UTIL_H + +/* +** Internal functions and macros used by both the CA tree and the AVL tree +*/ + +/* +** A stack of this size is enough for an AVL tree with more than +** 0xFFFFFFFF elements. May be subject to change if +** the datatype of the element counter is changed to a 64 bit integer. +** The Maximal height of an AVL tree is calculated as: +** h(n) <= 1.4404 * log(n + 2) - 0.328 +** Where n denotes the number of nodes, h(n) the height of the tree +** with n nodes and log is the binary logarithm. +*/ + +#define STACK_NEED 50 + +#define PUSH_NODE(Dtt, Tdt) \ + ((Dtt)->array[(Dtt)->pos++] = Tdt) + +#define POP_NODE(Dtt) \ + (((Dtt)->pos) ? \ + (Dtt)->array[--((Dtt)->pos)] : NULL) + +#define TOP_NODE(Dtt) \ + ((Dtt->pos) ? \ + (Dtt)->array[(Dtt)->pos - 1] : NULL) + +#define EMPTY_NODE(Dtt) (TOP_NODE(Dtt) == NULL) + +static ERTS_INLINE void free_term(DbTable *tb, TreeDbTerm* p) +{ + db_free_term(tb, p, offsetof(TreeDbTerm, dbterm)); +} + +/* +** Some macros for "direction stacks" +*/ +#define DIR_LEFT 0 +#define DIR_RIGHT 1 +#define DIR_END 2 + +static ERTS_INLINE Sint cmp_key(DbTableCommon* tb, Eterm key, TreeDbTerm* obj) { + return CMP(key, GETKEY(tb,obj->dbterm.tpl)); +} + +int tree_balance_left(TreeDbTerm **this); +int tree_balance_right(TreeDbTerm **this); + +int db_first_tree_common(Process *p, DbTable *tbl, TreeDbTerm *root, + Eterm *ret, DbTableTree *stack_container); +int db_next_tree_common(Process *p, DbTable *tbl, + TreeDbTerm *root, Eterm key, + Eterm *ret, DbTreeStack* stack); +int db_last_tree_common(Process *p, DbTable *tbl, TreeDbTerm *root, + Eterm *ret, DbTableTree *stack_container); +int db_prev_tree_common(Process *p, DbTable *tbl, TreeDbTerm *root, Eterm key, + Eterm *ret, DbTreeStack* stack); +int db_put_tree_common(DbTableCommon *tb, TreeDbTerm **root, Eterm obj, + int key_clash_fail, DbTableTree *stack_container); +int db_get_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm *root, Eterm key, + Eterm *ret, DbTableTree *stack_container); +int db_get_element_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm *root, Eterm key, + int ndex, Eterm *ret, DbTableTree *stack_container); +int db_member_tree_common(DbTableCommon *tb, TreeDbTerm *root, Eterm key, Eterm *ret, + DbTableTree *stack_container); +int db_erase_tree_common(DbTable *tbl, TreeDbTerm **root, Eterm key, Eterm *ret, + DbTreeStack *stack /* NULL if no static stack */); +int db_erase_object_tree_common(DbTable *tbl, TreeDbTerm **root, Eterm object, + Eterm *ret, DbTableTree *stack_container); +int db_slot_tree_common(Process *p, DbTable *tbl, TreeDbTerm *root, + Eterm slot_term, Eterm *ret, + DbTableTree *stack_container); +int db_select_chunk_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm **root, + Eterm tid, Eterm pattern, Sint chunk_size, + int reverse, Eterm *ret, + DbTableTree *stack_container); +int db_select_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm **root, + Eterm tid, Eterm pattern, int reverse, Eterm *ret, + DbTableTree *stack_container); +int db_select_delete_tree_common(Process *p, DbTable *tbl, + TreeDbTerm **root, + Eterm tid, Eterm pattern, + Eterm *ret, + DbTreeStack* stack); +int db_select_continue_tree_common(Process *p, + DbTableCommon *tb, + TreeDbTerm **root, + Eterm continuation, + Eterm *ret, + DbTableTree *stack_container); +int db_select_delete_continue_tree_common(Process *p, + DbTable *tbl, + TreeDbTerm **root, + Eterm continuation, + Eterm *ret, + DbTreeStack* stack); +int db_select_count_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm **root, + Eterm tid, Eterm pattern, Eterm *ret, + DbTableTree *stack_container); +int db_select_count_continue_tree_common(Process *p, + DbTableCommon *tb, + TreeDbTerm **root, + Eterm continuation, + Eterm *ret, + DbTableTree *stack_container); +int db_select_replace_tree_common(Process *p, DbTableCommon *tb, TreeDbTerm **root, + Eterm tid, Eterm pattern, Eterm *ret, + DbTableTree *stack_container); +int db_select_replace_continue_tree_common(Process *p, + DbTableCommon *tb, + TreeDbTerm **root, + Eterm continuation, + Eterm *ret, + DbTableTree *stack_container); +int db_take_tree_common(Process *p, DbTable *tbl, TreeDbTerm **root, + Eterm key, Eterm *ret, + DbTreeStack *stack /* NULL if no static stack */); +void db_print_tree_common(fmtfn_t to, void *to_arg, + int show, TreeDbTerm *root, DbTable *tbl); +void db_foreach_offheap_tree_common(TreeDbTerm *root, + void (*func)(ErlOffHeap *, void *), + void * arg); +int db_lookup_dbterm_tree_common(Process *p, DbTable *tbl, TreeDbTerm **root, + Eterm key, Eterm obj, DbUpdateHandle* handle, + DbTableTree *stack_container); +void db_finalize_dbterm_tree_common(int cret, DbUpdateHandle *handle, + DbTableTree *stack_container); +#endif /* _DB_TREE_UTIL_H */ |