diff options
Diffstat (limited to 'erts/emulator/internal_doc')
-rw-r--r-- | erts/emulator/internal_doc/CarrierMigration.md | 285 | ||||
-rw-r--r-- | erts/emulator/internal_doc/CodeLoading.md | 186 | ||||
-rw-r--r-- | erts/emulator/internal_doc/DelayedDealloc.md | 175 | ||||
-rw-r--r-- | erts/emulator/internal_doc/PTables.md | 356 | ||||
-rw-r--r-- | erts/emulator/internal_doc/PortSignals.md | 267 | ||||
-rw-r--r-- | erts/emulator/internal_doc/ProcessManagementOptimizations.md | 172 | ||||
-rw-r--r-- | erts/emulator/internal_doc/SuperCarrier.md | 191 | ||||
-rw-r--r-- | erts/emulator/internal_doc/ThreadProgress.md | 308 | ||||
-rw-r--r-- | erts/emulator/internal_doc/Tracing.md | 220 | ||||
-rw-r--r-- | erts/emulator/internal_doc/dec.erl | 43 | ||||
-rw-r--r-- | erts/emulator/internal_doc/erl_ext_dist.txt | 21 |
11 files changed, 2192 insertions, 32 deletions
diff --git a/erts/emulator/internal_doc/CarrierMigration.md b/erts/emulator/internal_doc/CarrierMigration.md new file mode 100644 index 0000000000..2a9594db25 --- /dev/null +++ b/erts/emulator/internal_doc/CarrierMigration.md @@ -0,0 +1,285 @@ +Carrier Migration +================= + +The ERTS memory allocators manage memory blocks in two types of raw +memory chunks. We call these chunks of raw memory +*carriers*. Singleblock carriers which only contain one large block, +and multiblock carriers which contain multiple blocks. A carrier is +typically created using `mmap()` on unix systems. However, how a +carrier is created is of minor importance. An allocator instance +typically manages a mixture of single- and multiblock carriers. + +Problem +------- + +When a carrier is empty, i.e. contains only one large free block, it +is deallocated. Since multiblock carriers can contain both allocated +blocks and free blocks at the same time, an allocator instance might +be stuck with a large amount of poorly utilized carriers if the memory +load decreases. After a peak in memory usage it is expected that not +all memory can be returned since the blocks still allocated are likely +to be dispersed over multiple carriers. Such poorly utilized carriers +can usually be reused if the memory load increases again. However, +since each scheduler thread manages its own set of allocator +instances, and memory load is not necessarily correlated to CPU load, we +might get into a situation where there are lots of poorly utilized +multiblock carriers on some allocator instances while we need to +allocate new multiblock carriers on other allocator instances. In +scenarios like this, the demand for multiblock carriers in the system +might increase at the same time as the actual memory demand in the +system has decreased which is both unwanted and quite unexpected for +the end user. + +Solution +-------- + +In order to prevent scenarios like this we've implemented support for +migration of multiblock carriers between allocator instances of the +same type. + +### Management of Free Blocks ### + +In order to be able to remove a carrier from one allocator instance +and add it to another we need to be able to move references to the +free blocks of the carrier between the allocator instances. The +allocator instance specific data structure referring to the free +blocks it manages often refers to the same carrier from multiple +places. For example, when the address order bestfit strategy is used +this data structure is a binary search tree spanning all carriers that +the allocator instance manages. Free blocks in one specific carrier +can be referred to from potentially every other carrier that is +managed, and the amount of such references can be huge. That is, the +work of removing the free blocks of such a carrier from the search +tree will be huge. One way of solving this could be not to migrate +carriers that contain lots of free blocks, but this would prevent us +from migrating carriers that potentially need to be migrated in order +to solve the problem we set out to solve. + +By using one data structure of free blocks in each carrier and an +allocator instance-wide data structure of carriers managed by the +allocator instance, the work needed in order to remove and add +carriers can be kept to a minimum. When migration of carriers is +enabled on a specific allocator type, we require that an allocation +strategy with such an implementation is used. Currently we've +implemented this for three different allocation strategies. All of +these strategies use a search tree of carriers sorted so that we can +find the carrier with the lowest address that can satisfy the +request. Internally in carriers we use yet another search tree that +either implement address order first fit, address order best fit, +or best fit. The abbreviations used for these different allocation +strategies are `aoff`, and `aoffcaobf`, `aoffcbf`. + +### Carrier Pool ### + +In order to migrate carriers between allocator instances we move them +through a pool of carriers. In order for a carrier migration to +complete, one scheduler needs to move the carrier into the pool, and +another scheduler needs to take the carrier out of the pool. + +The pool is implemented as a lock-free, circular, double linked, +list. The list contains a sentinel which is used as the starting point +when inserting to, or fetching from, the pool. Carriers in the pool are +elements in this list. + +The list can be modified by all scheduler threads +simultaneously. During modifications the double linked list is allowed +to get a bit "out of shape". For example, following the `next` pointer +to the next element and then following the `prev` pointer does not +always take you back to were you started. The following is however +always true: + +* Repeatedly following `next` pointers will eventually take you to the + sentinel. +* Repeatedly following `prev` pointers will eventually take you to the + sentinel. +* Following a `next` or a `prev` pointer will take you to either an + element in the pool, or an element that used to be in the pool. + +When inserting a new element we search for a place to insert the +element by only following `next` pointers, and we always begin by +skipping the first element encountered. When trying to fetch an +element we do the same thing, but instead only follow `prev` pointers. + +By going different directions when inserting and fetching, we avoid +contention between threads inserting and threads fetching as much as +possible. By skipping one element when we begin searching, we preserve +the sentinel unmodified as much as possible. This is beneficial since +all search operations need to read the content of the sentinel. If we +were to modify the sentinel, the cache line containing the sentinel +would unnecessarily be bounced between processors. + +The `prev` and `next` fields in the elements of the list contain the +value of the pointer, a modification marker, and a deleted +marker. Memory operations on these fields are done using atomic memory +operations. When a thread has set the modification marker in a field, +no-one except the thread that set the marker is allowed to modify the +field. If multiple modification markers need to be set, we always +begin with `next` fields followed by `prev` fields in the order +following the actual pointers. This guarantees that no deadlocks will +occur. + +When a carrier is being removed from a pool, we mark it with a thread +progress value that needs to be reached before we are allowed to +modify the `next` and `prev` fields. That is, until we reach this +thread progress we are not allowed to insert the carrier into the pool +again, and we are not allowed to deallocate the carrier. This ensures +that threads inspecting the pool always will be able to traverse the +pool and reach valid elements. Once we have reached the thread +progress value that the carrier was tagged with, we know that no +threads may have references to it via the pool. + +### Migration ### + +There exists one pool for each allocator type enabling migration of +carriers between scheduler specific allocator instances of the same +allocator type. + +Each allocator instance keeps track of the current utilization of its +multiblock carriers. When the total utilization falls below the "abandon +carrier utilization limit" it starts to inspect the utilization of the +current carrier when deallocations are made. If also the utilization +of the carrier falls below the "abandon carrier utilization limit" it +unlinks the carrier from its data structure of available free blocks +and inserts the carrier into the pool. + +Since the carrier has been unlinked from the data structure of +available free blocks, no more allocations will be made in the +carrier. The allocator instance putting the carrier into the pool, +however, still has the responsibility of performing deallocations in +it while it remains in the pool. The allocator instance with this +deallocation responsibility is here called the **employer**. + +Each carrier has a flag field containing information about the +employing allocator instance, a flag indicating if the carrier is in +the pool or not, and a flag indicating if it is busy or not. When the +carrier is in the pool, the employing allocator instance needs to mark it +as busy while operating on it. If another thread inspects it in order +to try to fetch it from the pool, it will skip it if it is busy. When +fetching the carrier from the pool, employment will change and further +deallocations in the carrier will be redirected to the new +employer using the delayed dealloc functionality. + +If a carrier in the pool becomes empty, it will be withdrawn from the +pool. All carriers that become empty are also always passed to its +**owning** allocator instance for deallocation using the delayed +dealloc functionality. Since carriers this way always will be +deallocated by the owner that allocated the carrier, the +underlying functionality of allocating and deallocating carriers can +remain simple and doesn't have to bother about multiple threads. In a +NUMA system we will also not mix carriers originating from multiple +NUMA nodes. + +In short: + +* The allocator instance that created a carrier **owns** it. +* An empty carrier is always deallocated by its **owner**. +* **Ownership** never changes. +* The allocator instance that uses a carrier **employs** it. +* An **employer** can abandon a carrier into the pool. +* Pooled carriers are not allocated from. +* Deallocation in a pooled carrier is still performed by its **employer**. +* **Employment** can only change when a carrier is fetched from the pool. + +### Searching the pool ### + +To harbor real time characteristics, searching the pool is +limited. We only inspect a limited number of carriers. If none of +those carriers had a free block large enough to satisfy the allocation +request, the search will fail. A carrier in the pool can also be busy +if another thread is currently doing block deallocation work on the +carrier. A busy carrier will also be skipped by the search as it can +not satisfy the request. The pool is lock-free and we do not want to +block, waiting for the other thread to finish. + +#### Before OTP 17.4 #### + +When an allocator instance needs more carrier space, it always begins +by inspecting its own carriers that are waiting for thread progress +before they can be deallocated. If no such carrier could be found, it +then inspects the pool. If no carrier could be fetched from the pool, +it will allocate a new carrier. Regardless of where the allocator +instance gets the carrier from it the just links in the carrier into +its data structure of free blocks. + +#### After OTP 17.4 #### + +The old search algorithm had a problem as the search always started at +the same position in the pool, the sentinel. This could lead to +contention from concurrent searching processes. But even worse, it +could lead to a "bad" state when searches fail with a high rate +leading to new carriers instead being allocated. These new carriers +may later be inserted into the pool due to bad utilization. If the +frequency of insertions into the pool is higher than successful +fetching from the pool, memory will eventually get exhausted. + +This "bad" state consists of a cluster of small and/or highly +fragmented carriers located at the sentinel in the pool. The largest free +block in such a "bad" carrier is rather small, making it unable to satisfy +most allocation requests. As the search always started at the +sentinel, any such "bad" carriers that had been left in the pool would +eventually cluster together at the sentinel. All searches first +have to skip past this cluster of "bad" carriers to reach a "good" +carrier. When the cluster gets to the same size as the search limit, +all searches will essentially fail. + +To counter the "bad cluster" problem and also ease the contention, the +search will now always start by first looking at the allocators **own** +carriers. That is, carriers that were initially created by the +allocator itself and later had been abandoned to the pool. If none of +our own abandoned carrier would do, then the search continues into the +pool, as before, to look for carriers created by other +allocators. However, if we have at least one abandoned carrier of our +own that could not satisfy the request, we can use that as entry point +into the pool. + +The result is that we prefer carriers created by the thread itself, +which is good for NUMA performance. And we get more entry points when +searching the pool, which will ease contention and clustering. + +To do the first search among own carriers, every allocator instance +has two new lists: `pooled_list` and `traitor_list`. These lists are only +accessed by the allocator itself and they only contain the allocator's +own carriers. When an owned carrier is abandoned and put in the +pool, it is also linked into `pooled_list`. When we search our +`pooled_list` and find a carrier that is no longer in the pool, we +move that carrier from `pooled_list` to `traitor_list` as it is now +employed by another allocator. If searching `pooled_list` fails, we +also do a limited search of `traitor_list`. When finding an abandoned +carrier in `traitor_list` it is either employed or moved back to +`pooled_list` if it could not satisfy the allocation request. + +When searching `pooled_list` and `traitor_list` we always start at the +point where the last search ended. This to avoid clustering +problems and increase the probability to find a "good" carrier. As +`pooled_list` and `traitor_list` are only accessed by the owning +allocator instance, they need no thread synchronization at all. + +Furthermore, the search for own carriers that are scheduled +for deallocation is now done as the last search option. The idea is +that it is better to reuse a poorly utilized carrier than to +resurrect an empty carrier that was just about to be released back to +the OS. + +### Result ### + +The use of this strategy of abandoning carriers with poor utilization +and reusing them in allocator instances with an increased carrier +demand is extremely effective and completely eliminates the problems +that otherwise sometimes occurred when CPU load dropped while memory +load did not. + +When using the `aoffcaobf` or `aoff` strategies compared to `gf` or +`bf`, we loose some performance since we get more modifications in the +data structure of free blocks. This performance penalty is however +reduced using the `aoffcbf` strategy. A tradeoff between memory +consumption and performance is however inevitable, and it is up to +the user to decide what is most important. + +Further work +------------ + +It would be quite easy to extend this to allow migration of multiblock +carriers between all allocator types. More or less the only obstacle +is maintenance of the statistics information. + + diff --git a/erts/emulator/internal_doc/CodeLoading.md b/erts/emulator/internal_doc/CodeLoading.md new file mode 100644 index 0000000000..151b9cd57c --- /dev/null +++ b/erts/emulator/internal_doc/CodeLoading.md @@ -0,0 +1,186 @@ +Non-Blocking Code Loading +========================= + +Introduction +------------ + +Before OTP R16 when an Erlang code module was loaded, all other +execution in the VM were halted while the load operation was carried +out in single threaded mode. This might not be a big problem for +initial loading of modules during VM boot, but it can be a severe +problem for availability when upgrading modules or adding new code on +a VM with running payload. This problem grows with the number of cores +as both the time it takes to wait for all schedulers to stop increases +as well as the potential amount of halted ongoing work. + +In OTP R16, modules are loaded without blocking the VM. +Erlang processes may continue executing undisturbed in parallel during +the entire load operation. The code loading is carried out by a normal +Erlang process that is scheduled like all the others. The load +operation is completed by making the loaded code visible to all +processes in a consistent way with one single atomic +instruction. Non-blocking code loading will improve real-time +characteristics when modules are loaded/upgraded on a running SMP +system. + + +The Load Phases +--------------- + +The loading of a module is divided into two phases; a *prepare phase* +and a *finishing phase*. The prepare phase contains reading the BEAM +file format and all the preparations of the loaded code that can +easily be done without interference with the running code. The +finishing phase will make the loaded (and prepared) code accessible +from the running code. Old module versions (replaced or deleted) will +also be made inaccessible by the finishing phase. + +The prepare phase is designed to allow several "loader" processes to +prepare separate modules in parallel while the finishing phase can +only be done by one loader process at a time. A second loader process +trying to enter finishing phase will be suspended until the first +loader is done. This will only block the process, the scheduler is +free to schedule other work while the second loader is waiting. (See +`erts_try_seize_code_write_permission` and +`erts_release_code_write_permission`). + +The ability to prepare several modules in parallel is not currently +used as almost all code loading is serialized by the code_server +process. The BIF interface is however prepared for this. + + erlang:prepare_loading(Module, Code) -> LoaderState + erlang:finish_loading([LoaderState]) + +The idea is that `prepare_loading` could be called in parallel for +different modules and returns a "magic binary" containing the internal +state of each prepared module. Function `finish_loading` could take a +list of such states and do the finishing of all of them in one go. + +Currenlty we use the legacy BIF `erlang:load_module` which is now +implemented in Erlang by calling the above two functions in +sequence. Function `finish_loading` is limited to only accepts a list +with one module state as we do not yet use the multi module loading +feature. + + +The Finishing Sequence +---------------------- + +During VM execution, code is accessed through a number of data +structures. These *code access structures* are + +* Export table. One entry for every exported function. +* Module table. One entry for each loaded module. +* "beam_catches". Identifies jump destinations for catch instructions. +* "beam_ranges". Map code address to function and line in source file. + +The most frequently used of these structures is the export table that +is accessed in run time for every executed external function call to +get the address of the callee. For performance reasons, we want to +access all these structures without any overhead from thread +synchronization. Earlier this was solved with an emergency break. Stop +the entire VM to mutate these code access structures, otherwise treat +them as read-only. + +The solution in R16 is instead to *replicate* the code access +structures. We have one set of active structures read by the running +code. When new code is loaded the active structures are copied, the +copy is updated to include the newly loaded module and then a switch +is made to make the updated copy the new active set. The active set is +identified by a single global atomic variable +`the_active_code_index`. The switch can thus be made by a single +atomic write operation. The running code have to read this atomic +variable when using the active access structures, which means one +atomic read operation per external function call for example. The +performance penalty from this extra atomic read is however very small +as it can be done without any memory barriers at all (as described +below). With this solution we also preserve the transactional feature +of a load operation. Running code will never see the intermediate +result of a half loaded module. + +The finishing phase is carried out in the following sequence by the +BIF `erlang:finish_loading`: + +1. Seize exclusive code write permission (suspend process if needed + until we get it). + +2. Make a full copy of all the active access structures. This copy is + called the staging area and is identified by the global atomic + variable `the_staging_code_index`. + +3. Update all access structures in the staging area to include the + newly prepared module. + +4. Schedule a thread progress event. That is a time in the future when + all schedulers have yielded and executed a full memory barrier. + +5. Suspend the loader process. + +6. After thread progress, commit the staging area by assigning + `the_staging_code_index` to `the_active_code_index`. + +7. Release the code write permission allowing other processes to stage + new code. + +8. Resume the loader process allowing it to return from + `erlang:finish_loading`. + + +### Thread Progress + +The waiting for thread progress in 4-6 is necessary in order for +processes to read `the_active_code_index` atomic during normal +execution without any expensive memory barriers. When we write a new +value into `the_active_code_index` in step 6, we know that all +schedulers will see an updated and consistent view of all the new +active access structures once they become reachable through +`the_active_code_index`. + +The total lack of memory barrier when reading `the_active_code_index` +has one interesting consequence however. Different processes may see +the new code at different point in time depending on when different +cores happen to refresh their hardware caches. This may sound unsafe +but it actually does not matter. The only property we must guarantee +is that the ability to see the new code must spread with process +communication. After receiving a message that was triggered by new +code, the receiver must be guaranteed to also see the new code. This +will be guaranteed as all types of process communication involves +memory barriers in order for the receiver to be sure to read what the +sender has written. This implicit memory barrier will then also make +sure that the receiver reads the new value of `the_active_code_index` +and thereby also sees the new code. This is true for all kinds of +inter process communication (TCP, ETS, process name registering, +tracing, drivers, NIFs, etc) not just Erlang messages. + +### Code Index Reuse + +To optimize the copy operation in step 2, code access structures are +reused. In current solution we have three sets of code access +structures, identified by a code index of 0, 1 and 2. These indexes +are used in a round robin fashion. Instead of having to initialize a +completely new copy of all access structures for every load operation +we just have to update with the changes that have happened since the +last two code load operations. We could get by with only two code +indexes (0 and 1), but that would require yet another round of waiting +for thread progress before step 2 in the `finish_loading` sequence. We +cannot start reusing a code index as staging area until we know that +no lingering scheduler thread is still using it as the active code +index. With three generations of code indexes, the waiting for thread +progress in step 4-6 will give this guarantee for us. Thread progress +will wait for all running schedulers to reschedule at least one +time. No ongoing execution reading code access structures reached from +an old value of `the_active_code_index` can exist after a second round +of thread progress. + +The design choice between two or three generations of code access +structures is a trade-off between memory consumption and code loading +latency. + +### A Consistent Code View + +Some native BIFs may need to get a consistent snapshot view of the +active code. To do this it is important to only read +`the_active_code_index` one time and then use that index value for all +code accessing during the BIF. If a load operation is executed in +parallel, reading `the_active_code_index` a second time might result +in a different value, and thereby a different view of the code. diff --git a/erts/emulator/internal_doc/DelayedDealloc.md b/erts/emulator/internal_doc/DelayedDealloc.md new file mode 100644 index 0000000000..b7d87b839f --- /dev/null +++ b/erts/emulator/internal_doc/DelayedDealloc.md @@ -0,0 +1,175 @@ +Delayed Dealloc +=============== + +Problem +------- + +An easy way to handle memory allocation in a multi-threaded +environment is to protect the memory allocator with a global lock +which threads performing memory allocations or deallocations have to +have locked during the whole operation. This solution of course scales +very poorly, due to heavy lock contention. An improved solution of +this scheme is to use multiple thread specific instances of such an +allocator. That is, each thread allocates in its own allocator +instance which is protected by a lock. In the general case references +to memory need to be passed between threads. In the case where a +thread that needs to deallocate memory that originates from another +threads allocator instance a lock conflict is possible. In a system as +the Erlang VM where memory allocation/deallocation is frequent and +references to memory also are passed around between threads this +solution will also scale poorly due to lock contention. + +Functionality Used to Adress This problem +----------------------------------------- + +In order to reduce contention due to locking of allocator instances we +introduced completely lock free instances tied to each scheduler +thread, and an extra locked instance for other threads. The scheduler +threads in the system is expected to do the major part of the +work. Other threads may still be needed but should not perform any +major and/or time critical work. The limited amount of contention that +appears on the locked allocator instance can more or less be +disregarded. + +Since we still need to be able to pass references to memory between +scheduler threads we need some way to manage this. An allocator +instance belonging to one scheduler thread is only allowed to be +manipulated by that scheduler thread. When other threads need to +deallocate memory originating from a foreign allocator instance, they +only pass the memory block to a "message box" containing deallocation +jobs attached to the originating allocator instance. When a scheduler +thread detects such deallocation job it performs the actual +deallocation. + +The "message box" is implemented using a lock free single linked list +through the memory blocks to deallocate. The order of the elements in +this list is not important. Insertion of new free blocks will be made +somewhere near the end of this list. Requirering that the new blocks +need to be inserted at the end would cause unnecessary contention when +large amount of memory blocks are inserted simultaneous by multiple +threads. + +The data structure refering to this single linked list cover two cache +lines. One cache line containing information about the head of the +list, and one cache line containing information about the tail of the +list. This in order to reduce cache line ping ponging of this data +structure. The head of the list will only be manipulated by the thread +owning the allocator instance, and the tail will be manipulated by +other threads inserting deallocation jobs. + +### Tail ### + +In the tail part of the data structure we find a pointer to the last +element of the list, or at least something that is near the end of the +list. In the uncontended case it will point to the end of the list, +but when simultaneous insert operations are performed it will point to +something near the end of the list. + +When insterting an element one will try to write a pointer to the new +element in the next pointer of the element pointed to by the last +pointer. This is done using an atomic compare and swap that expects +the next pointer to be `NULL`. If this succeds the thread performing +this operation moves the last pointer to point to the newly inserted +element. + +If the atomic compare and swap described above failed, the last +pointer didn't point to the last element. In this case we need to +insert the new element somewhere inbetween the element that the last +pointer pointed to and the actual last element. If we do it this way +the last pointer will eventually end up at the last element when +threads stop adding new elements. When trying to insert somewhere near +the end and failing to do so, the inserting thread sometimes moves to +the next element and somtimes tries with the same element again. This +in order to spread the inserted elements during heavy contention. That +is, we try to spread the modifications of memory to different +locations instead of letting all threads continue to try to modify the +same location in memory. + +### Head ### + +The head contains pointers to begining of the list (`head.first`), and +to the first block which other threads may refer to +(`head.unref_end`). Blocks between these pointers are only refered to +by the head part of the data structure which is only used by the +thread owning the allocator instance. When these two pointers are not +equal the thread owning the allocator instance deallocate block after +block until `head.first` reach `head.unref_end`. + +We of course periodically need to move the `head.unref_end` closer to +the end in order to be able to continue deallocating memory +blocks. Since all threads inserting new elements in the linked list +will enter the list using the last pointer we can use this +knowledge. If we call `erts_thr_progress_later()` and wait until we +have reached that thread progress we know that no managed threads can +refer the elements up to the element pointed to by the last pointer at +the time when we called `erts_thr_progress_later()`. This since, all +managed threads must have left the code implementing this at least +once, and they always enters into the list via the last pointer. The +`tail.next` field contains information about next `head.unref_end` +pointer and thread progress that needs to be reached before we can +move `head.unref_end`. + +Unfortunately not only threads managed by the thread progress +functionality may insert memory blocks. Other threads also needs to be +taken care of. Other threads will not be as frequent users of this +functionality as managed threads, so using a less efficient scheme for +them is not that big of a problem. In order to handle unmanaged +threads we use two reference counters. When an unmanaged thread enters +this implementation it increments the reference counter currently +used, and when it leaves this implementation it decrements the same +reference counter. When the consumer thread calls +`erts_thr_progress_later()` in order to determine when it is safe to +move `head.unref_end`, it also swaps reference counters for unmanaged +threads. The previous current represents outstanding references from +the time up to this point. The new current represents future reference +following this point. When the consumer thread detects that we have +both reached the desired thread progress and when the previous current +reference counter reach zero it is safe to move the `head.unref_end`. + +The reason for using two reference counters is that we need to know +that the reference counter eventually will reach zero. If we only used +one reference counter it would potentially be held above zero for ever +by different unmanaged threads. + +### Empty List ### + +If no new memory blocks are inserted into the list, it should +eventually be emptied. All pointers to the list however expect to +always point to something. This is solved by inserting an empty +"marker" element, which only has to purpose of being there in the +absense of other elements. That is when the list is empty it only +contains this "marker" element. + +### Contention ### + +When elements are continously inserted by threads not owning the +allocator instance, the thread owning the allocator instance will be +able to work more or less undisturbed by other threads at the head end +of the list. At the tail end large amounts of simultaneous inserts may +cause contention, but we reduce such contention by spreading inserts +of new elements near the end instead of requiring all new elements to +be inserted at the end. + +### Schedulers and The Locked Allocator Instance ### + +Also the locked allocator instance for use by non-scheduler threads +have a message box for deallocation jobs just as all the other +allocator instances. The reason for this is that other threads may +allocate memory pass it to a scheduler that then needs to deallocate +it. We do not want the scheduler to have to wait for the lock on this +locked instance. Since also locked instances has message boxes for +deallocation jobs, the scheduler can just insert the job and avoid the +locking. + + +### A Benchmark Result ### + +When running the ehb benchmark, large amount of messages are passed +around between schedulers. All message passing will in some way or the +other cause memory allocation and deallocation. Since messages are +passed between different schedulers we will get contention on the +allocator instances where messages were allocated. By the introduction +of the delayed dealloc feature, we got a speedup of between 25-45%, +depending on configuration of the benchmark, when running on a +relatively new machine with an Intel i7 quad core processor with +hyper-threading using 8 schedulers.
\ No newline at end of file diff --git a/erts/emulator/internal_doc/PTables.md b/erts/emulator/internal_doc/PTables.md new file mode 100644 index 0000000000..6fe0e7665d --- /dev/null +++ b/erts/emulator/internal_doc/PTables.md @@ -0,0 +1,356 @@ +Process and Port Tables +======================= + +Problems +-------- + +The process table is a mapping from process identifiers to process +structure pointers. The process structure contains miscellaneous +information about a process, as for example pointers to its heap, +message queue, etc. When the runtime system needs to operate on a +process, it looks up the process structure in the process table using +the process identifier. An example of this is when passing a message +to a process. + +The process table has for a very long time just been an array of +pointers to process structures. Since process identifiers internally +in the runtime system are 28-bit integers it is quite easy to map a +process identifier to index into the array. The 28-bits were divided +into two sets. The least significant set of bits was used as index +into the array. The most significant set of bits was only used to be +able to distinguish between a number of identifiers with which map to +the same index in the array. As long as process table sizes of a power +of two was used we had 2^28 unique process identifiers. + +When the first SMP support was implemented, the table still was kept +more or less the same way, but protected by two types of locks. One +lock that protected the whole table against modifications and an array +of locks protecting different parts of the table. The exact locking +strategy previously used isn't interesting. What is interesting is +that it suffered from heavy lock contention especially when lots of +modifications was being made, but also when only performing lookups. + +In order to be able to detect when it is safe to deallocate a +previously used process structure, reference counting of the structure +was used. Also this was problematic, since simultaneous lookups needed +to modify the reference counter which also caused contention on the +cache line where the reference counter was located. This since all +modifications needs to be communicated between all involved +processors. + +The port table is very similar to the process table. The major +difference, at least in concept, is that it is a mapping from port +identifiers to port structures. It had a similar implementation, but +with some differences. Instead of being an array of pointers it was an +array of structures, and instead of being protected by two types of +locks it was only protected by one global lock. This table also +suffered from lock contention in various situations. + +Solution +-------- + +The process table was the major problem to address since processes are +much more frequently used than ports. The first implementation only +implemented this for processes, but since the port table is very +similar and very similar problems occur on the port table, the process +table implementation was later generalized so that it could also be +used implementing the port table. For simplicity I will only talk +about the process table in the following text, but the same will apply +to the port table unless otherwise stated. + +If we disregard the locking issues, the original solution is very +appealing. The mapping from process identifier to index into the array +is very fast, and this property is something we would like to +keep. The vast majority of operations on these tables are lookups so +optimizing for lookups is what we want to do. + +### Lookup ### + +Using a set of bits in the process identifier as index into an array +seems hard to beat. By replacing the array of pointers with an array +of our pointer sized atomic data type, a lookup will consist of the +following: + +1. Mapping the 28-bit integer to an index into the array. + + More about this mapping later. + +2. Read the pointer using an atomic memory operation at determined + index in array. + + On all platforms that we provide atomic memory operations, this is + just a `volatile` read, preventing the compiler to use values in + registers, forcing the a read from memory. + +3. Depending on use, issue appropriate memory barrier. + + A common barrier used is a barrier with acquire semantics. On + x86/x86_64 this maps to a compiler barrier preventing the compiler + to reorder instructions, but on other hardware often some kind of + light weight hardware memory barrier is also needed. + + When comparing with a locked approach, at least one heavy weight + memory barrier will be issued when locking the lock on most, if + not all, hardware architectures (including x86/x86_64), and often + some kind of light weight memory barrier will be issued when + unlocking the lock. + +When looking at this very simple solution with very little overhead +you might wonder why we didn't implement it this way from the +beginning. It all boils down to the read operation of the pointer. We +need some way to know that it is safe to access the memory pointed +to. One way of doing this is to place a reference counter in the +process structure. Increment of the reference counter at lookup needs +to be done atomically with the lookup. A lock can typically provide +this service for us, which was the approach we previously +used. Another approach could be to co-locate the reference counter +with the pointer in the table. The major problem with this approach is +the modifications of the reference counter. This since these +modification would have to be communicated between all involved +processor cause contention on the cache line containing the reference +counter. The new lookup approach above is possible since we can use +the "thread progress" functionality in order to determine when it is +safe to deallocate the process structure. We'll get back to this when +describing deletion in the table. + +Using this new lookup approach we wont modify any memory at all which +is important. A lookup conceptually only read memory, now this is true +in the implementation also which is important from a scalability +perspective. The previous implementation modified the cache line +containing the reference counter two times, and the cache line +containing the corresponding lock two times at each lookup. + +### Modifications of the Table ### + +A lightweight lookup in the table was the most important feature, but +we also wanted to improve modifications of the table. The process +table is modified when a new process is spawned, i.e. a new pointer is +inserted into the table, and when a process terminates, i.e. a pointer +is deleted in the table. + +Assuming that we spawn fewer processes than the maximum amount of +unique process identifiers in the system, one has always been able to +determine the order of process creation just by comparing process +identifiers. If PidX is larger than PidY, then PidX was created after +PidY assuming both identifiers originates from the same node. However, +since we have a quite limited amount of unique identifiers today +(2^28), this property cannot be relied upon if we create large amount +of processes. But never the less, this is a property the system always +have had. + +If we would have had a huge amount of unique identifiers available, it +would have tempting to drop or modify this ordering property as +described above. The ordering property could for example be based on +the scheduler performing the spawn operation. It would have been +possible to reserve large ranges of identifiers exclusive for each +scheduler thread which could be used minimizing the need for +communication when allocating identifiers. The amount of identifiers +we got to work with today is, however, not even close to be enough for +such an approach. + +Since we have a limited amount of unique identifiers, we need to be +careful not to waste them. If previously used identifiers are reused +too quick, identifiers originating from terminated processes will +refer to newly created processes, and mixups will occur. The +previously used approach was quite good at not wasting +identifiers. Using a modified version of the same approach also lets +us keep the ordering property that we have always had. + +#### Insert #### + +The original approach is more or less to search for next free index or +slot in the array. The search starts from the last slot allocated. If +we reach the end of the array we increase a "wrapped counter" and then +continue the search. The process identifier is constructed by writing +the index to the least significant set of bits, and the "wrapped +counter" to the most significant set of bits. The amount of bits in +each set of bits is decided at boot time, so that maximum index will +just fit into the least significant set of bits. + +In the modified lock free version of this approach we more or less do +it the same way, but with some important modifications trying to avoid +unnecessary contention when multiple schedulers create processes +simultaneously. Since multiple threads might be trying to search for +the next free slot at the same time from the same starting point we +want subsequent slots to be located in different cache lines. Multiple +schedulers simultaneously writing new pointers into the table are +therefore very likely to write into adjacent slots. If adjacent slots +are located in the same cache line all modification of this cache line +needs to be communicated between all involved processors which will be +very expensive and scale very poor. By locating adjacent slots in +different cache lines only true conflicts will trigger communication +between involved processors, i.e., avoiding false sharing. + +A cache line is larger than a pointer, typically 8 or 16 times larger, +so using one cache line for each slot only containing one pointer +would be a waste of space. Each cache line will be able to hold a +fixed amount of slots. The first slot of the table will be the first +slot of the first cache line, the second slot of the table will be the +first slot of the second cache line until we reach the end of the +array. The next slot after that will be the second slot of the first +cache line, etc, moving forward one cache line internal slot each time +we wrap. This way we will be able to fit the same amount of pointers +into an array of the same size while always keeping adjacent slots in +different cache lines. + +The mapping from identifier to slot or index into the array gets a bit +more complicated than before. Instead of a `shift` and a bitwise +`and`, we get two `shift`s, two bitwise `and`s, and an `add` (see +implementation of `erts_ptab_data2pix()` in `erl_ptab.h`). However, by +storing this information optimized for lookup we only need a `shift` +and a bitwise `and` on 32-bit platforms. On 64-bit platforms we got +enough room for the 28-bit identifier in the least significant +halfword, and the index in the most significant halfword, in other +words, we just need to read the most significant halfword to get the +index. That is, this operation is as fast, or faster than before. The +downside is that on 32-bit platforms we need to convert this +information into the 28-bit identifier number when printing, or when +ordering identifiers from the same node. These operations are, +however, extremely infrequent compared to lookups. + +When we insert a new element in the table we do the following: + +1. We begin by reserving space in the table by atomically + incrementing a counter of processes in the table. If our increment + brings the counter above the maximum size of the table, the + operation fail and a `system_limit` exception is raised. + +2. The table contains a 64-bit atomic variable of the last identifier + used. Only the least significant bits will be used when actually + creating the identifier. This identifier is where the search + begin. + +3. We increment last identifier value used. In order determine the + slot that corresponds to this identifier we call + `erts_ptab_data2pix()` that maps identifier to slot. We read the + content of the slot. If the slot is free we try to write a + reservation marker using an atomic compare and swap. If this fails + we repeat this step until it succeeds. + +4. Change the table variable of last identifier used. Since multiple + writes might occur at the same time this value may already have + been changed by to an identifier larger that the one we got. In + this case we can continue; otherwise, we need to change it to the + identifier we got. + +5. We now do some initializations of the process structure that + cannot be done before we know the process identifier, and have to + be done before we publish the structure in the table. This, for + example, includes storing the identifier in the process structure. + +6. Now we can publish the structure in the table by writing the the + pointer to the process structure in the slot previously reserved + in 3. + +Using this approach we keep the properties like identifier ordering, +and identifier reuse while improving performance and scalability. It +has one flaw, though. There is no guarantee that the operation will +terminate. This can quite easily be fixed though, and will be fixed in +the next release. We will get back to this below. + +#### Delete #### + +When a process terminates, we mark the process as terminated in the +process structure, the counter of number of processes in the table is +decreased, and the reference to the process structure is removed by +writing a `NULL` pointer into the corresponding slot. The scheduler +thread performing this then schedule a thread progress later job which +will do the final cleanup and deallocate the process structure. The +thread progress functionality will make sure that this job will not +execute until it is certain that all managed threads have dropped all +references to the process structure. + +### BIF Iterating Over the Table ### + +The `erlang:processes/1` and `erlang:port/1` BIFs iterate over the +tables and return corresponding identifiers. These BIF should return a +consistent snapshot of the table content during some time when the BIF +is executing. In order to implement this we use locking in a strange +way. We use an "inverted rwlock". + +When performing lookups in the table we do not need to bother about +the locking at all, but when modifying the table we read lock the +rwlock protecting the table which allows for multiple writers during +normal operation. When the BIF that iterates over the table need +access to the table it write locks the rwlock and reads content of the +table. The BIF do not read the whole table in one go but instead read +small chunks at time only write locking while reading. The actual +implementation of the BIFs is out of the scope of this document. + +An out of the box rwlock will typically suffer from contention on the +single cache line containing the state of the rwlock even in the case +we are only read locking. Instead of using such an rwlock, we have our +own implementation of reader optimized rwlocks which keeps track of +reader threads in separate thread specific cache lines. This in order +to avoid contention on a singe cache line. As long as we only do read +lock operations, threads only need to read a global cache line and +modify its own cache line, and by this minimize communication between +involved processors. The iterating BIFs are normally very infrequently +used, so in the normal case we will only do read lock operations on +the table global rwlock. + +### Future Improvements ### + +The first improvement is to fix the guarantee so that insert +operations will be guaranteed to terminate. When the operation starts +we verify that there actually exist a free slot that we can use. The +problem is that we might not find it since it may move when multiple +threads modify the table at the same time as we are trying to find the +slot. The easy fix is to abort the operation if an empty slot could +not be found in a finite number operation, and then restart the +operation under a write lock. This will be implemented in next +release, but furter work should be made trying to find a better +solution. + +This and also previous implementation do not work well when the table +is nearly full. We will both get long search times for free slots, and +we will reuse identifiers more frequently since we more frequently +wrap during the search. These tables works best when the table is much +larger than the amount of simultaneous existing processes. One easy +improvement is to always have room for more processes than we allow in +the table. This will also be implemented in the next release, but this +should probably also be worked more on trying to find an even better +solution. + +It would also be nice to get rid of the rwlock all together. The use +of a reader optimized rwlock makes sure we do not any contention on +the lock, but unnecessary memory barriers will be issued due to the +lock. The main issue here is to modify iterating BIFs so that they do +not require exclusive access to the table while reading a sequence of +slots. In principle this should be rather easy, the code can handle +sequences of variable sizes, so shrinking the sequence size of slots +to one would solv the problem. This will, however, need some tweeks +and modifications of not trival code, but is something that should be +looked at in the future. + +By increasing the size of identifiers, at least on 64-bit machines +(which isn't as easy as it first might seem) we get further room for +improvement. Besides the obvious improvement of not reusing +identifiers as fast as we currently do, it makes it possible to +further avoid contention when inserting elements in the table. At +least if we drop this ordering property, which isn't that useful +anyway. + +### Some Benchmark Results ### + +In order to test modifications of the process table we ran a couple of +benchmarks where lots of processes are spawned and terminated +simultaneously, and got a speedup of between 150-200%. Running a +similar benchmark but with ports we got a speedup of about 130%. + +The BIF `erlang:is_process_alive/1` is the closest you can get to a +process table lookup only. The BIF looks up the process corresponding +to the process identifier passed as argument, and then checks if it is +alive. By running multiple processes looping over this BIF checking +the same process, we get a speedup between 20000-23000%. Conceptually +this operation only involve read operations. In the implementation +used in R16B also only read operation are performed, while the +previous implementation need to lock structures in order to read the +data, suffering from both lock contention and contention due to +modifications of cache lines used by lock internal data structures and +the reference counter on the process being looked up. + +The benchmarks were run on a relatively new machine with an Intel i7 +quad core processor with hyper-threading using 8 schedulers. On a +machine with more communication overhead and/or larger amount of +logical processors the speedups are expected to be even larger. diff --git a/erts/emulator/internal_doc/PortSignals.md b/erts/emulator/internal_doc/PortSignals.md new file mode 100644 index 0000000000..b1afb7c5cb --- /dev/null +++ b/erts/emulator/internal_doc/PortSignals.md @@ -0,0 +1,267 @@ +Port Signals +============ + +Problems +-------- + +Erlang ports conceptually are very similar to Erlang processes. Erlang +processes execute Erlang code in the virtual machine, while an Erlang +port execute native code typically used for communication with the +outside world. For example, when an Erlang process wants to +communicate using TCP over the network, it communicates via an Erlang +port implementing the TCP socket interface in native code. Both Erlang +Processes and Ports communicate using asynchronous signaling. The +native code executed by an Erlang port is a collection of callback +functions, called a driver. Each callback more or less implements the +code of a signal to, or from the port. + +Even though processes and ports conceptually always have been very +similar, the implementations have been very different. Originally, +more or less all port signals were handled synchronously at the time +they occurred. Very early in the development of the SMP support for +the runtime system we recognized that this was a huge problem for +signals between ports and the outside world. That is, I/O events to +and from the outside world, or I/O signals. This was one of the first +things that had to be rewritten in order to be able to do I/O in +parallel at all. The solution was to implement scheduling of these +signals. I/O signals corresponding to different ports could then be +executed in parallel on different scheduler threads. Signals from +processes to ports was not as big of a problem as the I/O signals, and +the implementation of those was left as they were. + +Each port is protected by its own lock to protect against simultaneous +execution in multiple threads. Previously when a process, executing on +a scheduler thread, sent a port a signal, it locked the port lock and +synchronously executed the code corresponding to the signal. If the +lock was busy, the scheduler thread blocked waiting until it could +lock the lock. If multiple processes executing simultaneously on +different scheduler threads, sent signals to the same port, schedulers +suffered from heavy lock contention. Such contention could also occur +between I/O signals for the port executing on one scheduler thread, +and a signal from a process to the port executing on another scheduler +thread. Beside the contention issues, we also loose potential work to +execute in parallel on different scheduler threads. This since the +process sending the *asynchronous* signal is blocked while the code +implementing the signal is executed synchronously. + +Solution +-------- + +In order to prevent multiple schedulers from trying to execute signals +to/from the same port simultaneously, we need to be able to ensure +that all signals to/from a port are executed in sequence on one +scheduler. More or less, the only way to do this is to schedule all +types of signals. Signals corresponding to a port can then be executed +in sequence by one single scheduler thread. If only one thread tries +to execute the port, no contention will appear on the port +lock. Besides getting rid of the contention, processes sending signals +to the port can also continue execution of their own Erlang code on +other schedulers at the same time as the signaling code is executing +on another scheduler. + +When implementing this there are a couple of important properties that +we either need, or want to preserve: + +* Signal ordering guarantee. Signals from process `X` to port `Y`, + *must* be delivered to `Y` in the same order as sent from `X`. + +* Signal latency. Due to the previous synchronous implementation, + latency of signals sent from processes to ports have usually been + very low. During contention the latency has of course + increased. Users expect latency of these signals to be low, a + sudden increase in latency would not be appreciated by our users. + +* Compatible flow control. Ports have for a very long time had the + possibility to use the busy port functionality when implementing + flow control. One may argue that this functionality fits very bad + with the conceptually completely asynchronous signaling, but the + functionality has been there for ages and is expected to be + there. When a port sets itself into a busy state, `command` + signals should not be delivered, and senders of such signals + should suspend until the port sets itself in a not busy state. + +### Scheduling of Port Signals ### + +A run queue has four queues for processes of different priority and +one queue for ports. The scheduler thread associated with the run +queue switch evenly between execution of processes and execution of +ports while both processes and ports exist in the queue. This is not +completely true, but not important for this discussion. A port that is +in a run queue also has a queue of tasks to execute. Each task +corresponds to an in- or outgoing signal. When the port is selected +for execution each task will be executed in sequence. The run queue +locks not only protected the queues of ports, but also the queues of +port tasks. + +Since we go from a state where I/O signals are the only port related +signals scheduled, to a state where potentially all port related +signals may be scheduled we may drastically increase the load on the +run queue lock. The amount of scheduled port tasks very much depend on +the Erlang application executing, which we do not control, and we do +not want to get increased contention on the run queue locks. We +therefore need another approach of protecting the port task queue. + +#### Task Queue #### + +We chose a "semi locked" approach, with one public locked task queue, +and a private, lock free, queue like, task data structure. This "semi +locked" approach is similar to how the message boxes of processes are +managed. The lock is port specific and only used for protection of +port tasks, so the run queue lock is now needed in more or less the +same way for ports as for processes. This ensures that we wont see an +increased lock contention on run queue locks due to this rewrite of +the port functionality. + +When an executing port runs out of work to execute in the private task +data structure, it moves the public task queue into the private task +data structure while holding the lock. Once tasks has been moved to +the private data structure no lock protects them. This way the port +can continue working on tasks in the private data structure without +having to fight for the lock. + +I/O signals may however be aborted. This could be solved by letting +the port specific scheduling lock also protect the private task data +structure, but then the port very frequently would have to fight with +others enqueueing new tasks. In order to handle this while keeping the +private task data structure lock free, we use a similar "non +aggressive" approach as we use when handling processes that gets +suspended while in the run queue. Instead of removing the aborted port +task, we just mark it as aborted using an atomic memory +operation. When a task is selected for execution, we first verify that +it has not been aborted. If aborted we, just drop the task. + +A task that can be aborted is referred via another data structure from +other parts of the system, so that a thread that needs to abort the +task can reach it. In order to be sure to safely deallocate a task +that is no longer used, we first clear this reference and then use the +thread progress functionality in order to make sure no references can +exist to the task. Unfortunately, also unmanaged threads might abort +tasks. This is very infrequent, but might occur. This could be handled +locally for each port, but would require extra information in each +port structure which very infrequently would be used. Instead of +implementing this in each port, we implemented general functionality +that can be used from unmanaged threads to delay thread progress. + +The private "queue like" task data structure could have been an +ordinary queue if it wasn't for the busy port functionality. When the +port has flagged itself as busy, `command` signals are not allowed to +be delivered and need to be blocked. Other signals sent from the same +sender following a `command` signal that has been blocked also have to +be blocked; otherwise, we would violate the ordering guarantee. At the +same time, other signals that have no dependencies to blocked +`command` signals are expected to be delivered. + +The above requirements makes the private task data structure a rather +complex data structure. It has a queue of unprocessed tasks, and a +busy queue. The busy queue contains blocked tasks corresponding to +`command` signals, and tasks with dependencies to such tasks. The busy +queue is accompanied by a table over blocked tasks based on sender +with a references into last task in the busy queue from a specific +sender. This since we need check for dependencies when new tasks are +processed in the queue of unprocessed tasks. When a new task is +processed that needs to be blocked it isn't enqueued at the end of the +busy queue, but instead directly after the last task with the same +sender. This in order to easily be able to detect when we have tasks +that no longer have any dependencies to tasks corresponding to +`command` signals which should be moved out of the busy queue. When +the port executes, it switches between processing tasks from the busy +queue, and processing directly from the unprocessed queue based on its +busy state. When processing directly from the unprocessed queue it +might, of course, have to move a task into the busy queue instead of +executing it. + +#### Busy Port Queue #### + +Since it is the port itself which decides when it is time to enter a +busy state, it needs to be executing in order to enter the busy +state. As a result of `command` signals being scheduled, we may get +into a situation where the port gets flooded by a huge amount of +`command` signals before it even gets a chance to set itself into a +busy state. This since it has not been scheduled for execution +yet. That is, under these circumstances the busy port functionality +loose the flow control properties it was intended to provide. + +In order to solve this, we introduced a new busy feature, namely "busy +port queue". The port has a limit of `command` data that is allowed to +be enqueued in the task queue. When this limit is reached, the port +will automatically enter a busy port queue state. When in this state, +senders of `command` signals will be suspended, but `command` signals +will still be delivered to the port unless it is also in a busy port +state. This limit is known as the high limit. + +There is also a low limit. When the amount of queued `command` data +falls below this limit and the port is in a busy port queue state, the +busy port queue state is automatically disabled. The low limit should +typically be significantly lower than the high limit in order to +prevent frequent oscillation around the busy port queue state. + +By introduction of this new busy state we still can provide the flow +control. Old driver do not even have to be changed. The limits can, +however, be configured and even disabled by the port. By default the +high limit is 8 KB and the low limit is 4 KB. + +### Preparation of Signal Send ### + +Previously all operations sending signals to ports began by acquiring +the port lock, then performed preparations for sending the signal, and +then finaly sent the signal. The preparations typically included +inspecting the state of the port, and preparing the data to pass along +with the signal. The preparation of data is frequently quite time +consuming, and did not really depend on the port. That is we would +like to do this without having the port lock locked. + +In order to improve this, state information was re-organized in the +port structer, so that we can access it using atomic memory +operations. This together with the new port table implementation, +enabled us to lookup the port and inspect the state before acquiring +the port lock, which in turn made it possible to perform preparations +of signal data before acquiring the port lock. + +### Preserving Low Latency ### + +If we disregard the contended cases, we will inevitably get a higher +latency when scheduling signals for execution at a later time than by +executing the signal immediately. In order to preserve the low latency +we now first check if this is a contended case or not. If it is, we +schedule the signal for later execution; otherwise, we execute the +signal immediately. It is a contended case if other signals already +are scheduled on the port, or if we fail to acquire the port +lock. That is we will not block waiting for the lock. + +Doing it this way we will preserve the low latency at the expense of +lost potential parallel execution of the signal and other code in the +process sending the signal. This default behaviour can however be +changed on port basis or system wide, forcing scheduling of all +signals from processes to ports that are not part of a synchronous +communication. That is, an unconditional request/response pair of +asynchronous signals. In this case it is no potential for parallelism, +and by that no point forcing scheduling of the request signal. + +The immediate execution of signals may also cause a scheduler that is +about to execute scheduled tasks to block waiting for the port +lock. This is however more or less the only scenario where a scheduler +needs to wait for the port lock. The maximum time it has to wait is +the time it takes to execute one signal, since we always schedule +signals when contention occurs. + +### Signal Operations ### + +Besides implementing the functionality enabling the scheduling, +preparation of signal data without port lock, etc, each operation +sending signals to ports had to be quite extensively re-written. This +in order to move all sub-operations that can be done without the lock +to a place before we have acquired the lock, and also since signals +now sometimes are executed immediately and sometimes scheduled for +execution at a later time which put different requirements on the data +to pass along with the signal. + +### Some Benchmark Results ### + +When running some simple benchmarks where contention only occur due to +I/O signals contending with signals from one single process we got a +speedup of 5-15%. When multiple processes send signals to one single +port the improvements can be much larger, but the scenario with one +process contending with I/O is the most common one. + +The benchmarks were run on a relatively new machine with an Intel i7 +quad core processor with hyper-threading using 8 schedulers.
\ No newline at end of file diff --git a/erts/emulator/internal_doc/ProcessManagementOptimizations.md b/erts/emulator/internal_doc/ProcessManagementOptimizations.md new file mode 100644 index 0000000000..9e83633bef --- /dev/null +++ b/erts/emulator/internal_doc/ProcessManagementOptimizations.md @@ -0,0 +1,172 @@ +Process Management Optimizations +================================ + +Problems +-------- + +Early versions of the SMP support for the runtime system completely +relied on locking in order to protect data accesses from multiple +threads. In some cases this isn't that problematic, but in some cases +it really is. It complicates the code, ensuring all locks needed are +actually held, and ensuring that all locks are acquired in such an +order that no deadlock occur. Acquiring locks in the right order often +also involve releasing locks held, forcing threads to reread data +already read. A good recipe for creation of bugs. Trying to use more +fine-grained locking in order to increase possible parallelism in the +system makes the complexity situation even worse. Having to acquire a +bunch of locks when doing operations also often cause heavy lock +contention which cause poor scalability. + +Management of processes internally in the runtime system suffered from +these problems. When changing state on a process, for example from +`waiting` to `runnable`, a lock on the process needed to be +locked. When inserting a process into a run queue also a lock +protecting the run queue had to be locked. When migrating a process +from one run queue to another run queue, locks on both run queues and +on the process had to be locked. + +This last example is a quite common case in during normal +operation. For example, when a scheduler thread runs out of work it +tries to steal work from another scheduler threads run queue. When +searching for a victim to steal from there was a lot of juggling of +run queue locks involved, and during the actual theft finalized by +having to lock both run queues and the process. When one scheduler +runs out of work, often others also do, causing lots of lock +contention. + +Solution +-------- + +### Process ### + +In order to avoid these situations we wanted to be able to do most of +the fundamental operations on a process without having to acquire a +lock on the process. Some examples of such fundamental operations are, +moving a process between run queues, detecting if we need to insert it +into a run queue or not, detecting if it is alive or not. + +All of this information in the process structure that was needed by +these operations was protected by the process `status` lock, but the +information was spread across a number of fields. The fields used was +typically state fields that could contain a small number of different +states. By reordering this information a bit we could *easily* fit +this information into a 32-bit wide field of bit flags (only 12-flags +were needed). By moving this information we could remove five 32-bit +wide fields and one pointer field from the process structure! The move +also enabled us to easily read and change the state using atomic +memory operations. + +### Run Queue ### + +As with processes we wanted to be able to do the most fundamental +operations without having to acquire a lock on it. The most important +being able to determine if we should enqueue a process in a specific +run queue or not. This involves being able to read actual load, and +load balancing information. + +The load balancing functionality is triggered at repeated fixed +intervals. The load balancing more or less strives to even out run +queue lengths over the system. When balancing is triggered, +information about every run queue is gathered, migrations paths and +run queue length limits are set up. Migration paths and limits are +fixed until the next balancing has been done. The most important +information about each run queue is the maximum run queue length since +last balancing. All of this information were previously stored in the +run queues themselves. + +When a process has become runnable, for example due to reception of a +message, we need to determine which run queue to enqueue it +in. Previously this at least involved locking the run queue that the +process currently was assigned to while holding the status lock on the +process. Depending on load we sometimes also had to acquire a lock on +another run queue in order to be able to determine if it should be +migrated to that run queue or not. + +In order to be able to decide which run queue to use without having to +lock any run queues, we moved all fixed balancing information out of +the run queues into a global memory block. That is, migration paths +and run queue limits. Information that need to be frequently updated, +like for example maximum run queue length, were kept in the run queue, +but instead of operating on this information under locks we now use +atomic memory operations when accessing this information. This made it +possible to first determine which run queue to use, without locking +any run queues, and when decided, lock the chosen run queue and insert +the process. + +#### Fixed Balancing Information #### + +When determining which run queue to choose we need to read the fixed +balancing information that we moved out of the run queues. This +information is global, read only between load balancing operations, +but will be changed during a load balancing. We do not want to +introduce a global lock that needs to be acquired when accessing this +information. A reader optimized rwlock could avoid some of the +overhead since the data is most frequently read, but it would +unavoidably cause disruption during load balancing, since this +information is very frequently read. The likelihood of a large +disruption due to this also increase as number of schedulers grows. + +Instead of using a global lock protecting modifications of this +information, we write a completely new version of it at each load +balancing. The new version is written in another memory block than the +previous one, and published by issuing a write memory barrier and then +storing a pointer to the new memory block in a global variable using +an atomic write operation. + +When schedulers need to read this information, they read the pointer +to currently used information using an atomic read operation, and then +issue a data dependency read barrier, which on most architectures is a +no-op. That is, it is very little overhead getting access to this +information. + +Instead of allocating and deallocating memory blocks for the different +versions of the balancing information we keep old memory blocks and +reuse them when it is safe to do so. In order to be able to determine +when it is safe to reuse a block we use the thread progress +functionality, ensuring that no threads have any references to the +memory block when we reuse it. + +#### Be Less Aggressive #### + +We implemented a test version using lock free run queues. This +implementation did however not perform as good as the version using +one lock per run queue. The reason for this was not investigated +enough to say why this was. Since the locked version performed better +we kept it, at least for now. The lock free version, however, forced +us to use other solutions, some of them we kept. + +Previously when a process that was in a run queue got suspended, we +removed it from the queue straight away. This involved locking the +process, locking the run queue, and then unlinking it from the double +linked list implementing the queue. Removing a process from a lock +free queue gets really complicated. Instead, of removing it from the +queue, we just leave it in the queue and mark it as suspended. When +later selected for execution we check if the process is suspended, if +so just dropped it. During its time in the queue, it might also get +resumed again, if so execute it when it get selected for execution. + +By keeping this part when reverting back to a locked implementation, +we could remove a pointer field in each process structure, and avoid +unnecessary operations on the process and the queue which might cause +contention. + +### Combined Modifications ### + +By combining the modifications of the process state management and the +run queue management, we can do large parts of the work involved when +managing processes with regards to scheduling and migration without +having any locks locked at all. In these situations we previously had +to have multiple locks locked. This of course caused a lot of rewrites +across large parts of the runtime system, but the rewrite both +simplified code and eliminated locking at a number of places. The +major benefit is, of course, reduced contention. + +### A Benchmark Result ### + +When running the chameneosredux benchmark, schedulers frequently run +out of work trying to steal work from each other. That is, either +succeeding in migrating, or trying to migrate processes which is a +scenario which we wanted to optimize. By the introduction of these +improvements, we got a speedup of 25-35% when running this benchmark +on a relatively new machine with an Intel i7 quad core processor with +hyper-threading using 8 schedulers.
\ No newline at end of file diff --git a/erts/emulator/internal_doc/SuperCarrier.md b/erts/emulator/internal_doc/SuperCarrier.md new file mode 100644 index 0000000000..0ad6af41de --- /dev/null +++ b/erts/emulator/internal_doc/SuperCarrier.md @@ -0,0 +1,191 @@ +Super Carrier +============= + +A super carrier is large memory area, allocated at VM start, which can +be used during runtime to allocate normal carriers from. + +The super carrier feature was introduced in OTP R16B03. It is +enabled with command line option +MMscs <size in Mb> +and can be configured with other options. + +Problem +------- + +The initial motivation for this feature was customers asking for a way +to pre-allocate physcial memory at VM start for it to use. + +Other problems were different experienced limitations of the OS +implementation of mmap: + +* Increasingly bad performance of mmap/munmap as the number of mmap'ed areas grow. +* Fragmentation problem between mmap'ed areas. + +A third problem was management of low memory in the halfword +emulator. The implementation used a naive linear search structure to +hold free segments which would lead to poor performance when +fragmentation increased. + + +Solution +-------- + +Allocate one large continious area of address space at VM start and +then use that area to satisfy our dynamic memory need during +runtime. In other words: implement our own mmap. + +### Use cases ### + +If command line option +MMscrpm (Reserve Physical Memory) is set to +false, only virtual space is allocated for the super carrier from +start. The super carrier then acts as an "alternative mmap" implementation +without changing the consumption of physical memory pages. Physical +pages will be reserved on demand when an allocation is done from the super +carrier and be unreserved when the memory is released back to the +super carrier. + +If +MMscrpm is set to true, which is default, the initial allocation +will reserve physical memory for the entire super carrier. This can be +used by users that want to ensure a certain *minimum* amount of +physical memory for the VM. + +However, what reservation of physical memory actually means highly +depends on the operating system, and how it is configured. For +example, different memory overcommit settings on Linux drastically +change the behaviour. + +A third feature is to have the super carrier limit the *maximum* +amount of memory used by the VM. If +MMsco (Super Carrier Only) is set +to true, which is default, allocations will only be done from the +super carrier. When the super carrier gets full, the VM will fail due +to out of memory. +If +MMsco is false, allocations will use mmap directly if the super +carrier is full. + + + +### Implementation ### + +The entire super carrier implementation is kept in erl_mmap.c. The +name suggest that it can be viewed as our own mmap implementation. + +A super carrier needs to satisfy two slightly different kinds of +allocation requests; multi block carriers (MBC) and single block +carriers (SBC). They are both rather large blocks of continious +memory, but MBCs and SBCs have different demands on alignment and +size. + +SBCs can have arbitrary size and do only need minimum 8-byte +alignment. + +MBCs are more restricted. They can only have a number of fixed +sizes that are powers of 2. The start address need to have a very +large aligment (currently 256 kb, called "super alignment"). This is a +design choice that allows very low overhead per allocated block in the +MBC. + +To reduce fragmentation within the super carrier, it is good to keep SBCs +and MBCs apart. MBCs with their uniform alignment and sizes can be +packed very efficiently together. SBCs without demand for aligment can +also be allocated quite efficiently together. But mixing them can lead +to a lot of memory wasted when we need to create large holes of +padding to the next alignment limit. + +The super carrier thus contains two areas. One area for MBCs growing from +the bottom and up. And one area for SBCs growing from the top and +down. Like a process with a heap and a stack growing towards each +other. + + +### Data structures ### + +The MBC area is called **sa** as in super aligned and the SBC area is +called **sua** as in super un-aligned. + +Note that the "super" in super alignment and the "super" in super +carrier has nothing to do with each other. We could have choosen +another naming to avoid confusion, such as "meta" carrier or "giant" +aligment. + + +-------+ <---- sua.top + | sua | + | | + |-------| <---- sua.bot + | | + | | + | | + |-------| <---- sa.top + | | + | sa | + | | + +-------+ <---- sa.bot + + +When a carrier is deallocated a free memory segment will be created +inside the corresponding area, unless the carrier was at the very top +(in `sa`) or bottom (in `sua`) in which case the area will just shrink +down or up. + +We need to keep track of all the free segments in order to reuse them +for new carrier allocations. One initial idea was to use the same +mechanism that is used to keep track of free blocks within MBCs +(alloc_util and the different strategies). However, that would not be +as straight forward as one can think and can also waste quite a lot of +memory as it uses prepended block headers. The granularity of the +super carrier is one memory page (usually 4kb). We want to allocate +and free entire pages and we don't want to waste an entire page just +to hold the block header of the following pages. + +Instead we store the meta information about all the free segments in a +dedicated area apart from the `sa` and `sua` areas. Every free segment is +represented by a descriptor struct (`ErtsFreeSegDesc`). + + typedef struct { + RBTNode snode; /* node in 'stree' */ + RBTNode anode; /* node in 'atree' */ + char* start; + char* end; + }ErtsFreeSegDesc; + +To find the smallest free segment that will satisfy a carrier allocation +(best fit), the free segments are organized in a tree sorted by +size (`stree`). We search in this tree at allocation. If no free segment of +sufficient size was found, the area (`sa` or `sua`) is instead expanded. +If two or more free segments with equal size exist, the one at lowest +address is choosen for `sa` and highest address for `sua`. + +At carrier deallocation, we want to coalesce with any adjacent free +segments, to form one large free segment. To do that, all free +segments are also organized in a tree sorted in address order (`atree`). + +So, in total we keep four trees of free descriptors for the super +carrier; two for `sa` and two for `sua`. They all use the same +red-black-tree implementation that support the different sorting +orders used. + +When allocating a new MBC we first search after a free segment in `sa`, +then try to raise `sa.top`, and then as a fallback try to search after a +free segment in `sua`. When an MBC is allocated in `sua`, a larger segment +is allocated which is then trimmed to obtain the right +alignment. Allocation search for an SBC is done in reverse order. When +an SBC is allocated in `sa`, the size is aligned up to super aligned +size. + +### The free descriptor area ### + +As mentioned above, the descriptors for the free segments are +allocated in a separate area. This area has a constant configurable +size (+MMscrfsd) that defaults to 65536 descriptors. This should be +more than enough in most cases. If the descriptors area should fill up, +new descriptor areas will be allocated first directly from the OS, and +then from `sua` and `sa` in the super carrier, and lastly from the memory +segment itself which is being deallocated. Allocating free descriptor +areas from the super carrier is only a last resort, and should be +avoided, as it creates fragmentation. + +### Halfword emulator ### + +The halfword emulator uses the super carrier implementation to manage +its low memory mappings thar are needed for all term storage. The +super carrier can here not be configured by command line options. One +could imagine a second configurable instance of the super carrier used +by high memory allocation, but that has not been implemented. diff --git a/erts/emulator/internal_doc/ThreadProgress.md b/erts/emulator/internal_doc/ThreadProgress.md new file mode 100644 index 0000000000..6118bcf0f6 --- /dev/null +++ b/erts/emulator/internal_doc/ThreadProgress.md @@ -0,0 +1,308 @@ +Thread Progress +=============== + +Problems +-------- + +### Knowing When Threads Have Completed Accesses to a Data Structure ### + +When multiple threads access the same data structure you often need to +know when all threads have completed their accesses. For example, in +order to know when it is safe to deallocate the data structure. One +simple way to accomplish this is to reference count all accesses to +the data structure. The problem with this approach is that the cache +line where the reference counter is located needs to be communicated +between all involved processors. Such communication can become +extremely expensive and will scale poorly if the reference counter is +frequently accessed. That is, we want to use some other approach of +keeping track of threads than reference counting. + +### Knowing That Modifications of Memory is Consistently Observed ### + +Different hardware architectures have different memory models. Some +architectures allows very aggressive reordering of memory accesses +while other architectures only reorder a few specific cases. Common to +all modern hardware is, however, that some type of reordering will +occur. When using locks to protect all memory accesses made from +multiple threads such reorderings will not be visible. The locking +primitives will ensure that the memory accesses will be ordered. When +using lock free algorithms one do however have to take this reordering +made by the hardware into account. + +Hardware memory barriers or memory fences are instructions that can be +used to enforce order between memory accesses. Different hardware +architectures provide different memory barriers. Lock free algorithms +need to use memory barriers in order to ensure that memory accesses +are not reordered in such ways that the algorithm breaks down. Memory +barriers are also expensive instructions, so you typically want to +minimize the use of these instructions. + +Functionality Used to Address These Problems +------------------------------------------- + +The "thread progress" functionality in the Erlang VM is used to +address these problems. The name "thread progress" was chosen since we +want to use it to determine when all threads in a set of threads have +made such progress so that two specific events have taken place for +all them. + +The set of threads that we are interested in we call managed +threads. The managed threads are the only threads that we get any +information about. These threads *have* to frequently report +progress. Not all threads in the system are able to frequently report +progress. Such threads cannot be allowed in the set of managed threads +and are called unmanaged threads. An example of unmanaged threads are +threads in the async thread pool. Async threads can be blocked for +very long times and by this be prevented from frequently reporting +progress. Currently only scheduler threads and a couple of other +threads are managed threads. + +### Thread Progress Events ### + +Any thread in the system may use the thread progress functionality in +order to determine when the following events have occured at least +once in all managed threads: + +1. The thread has returned from other code to a known state in the + thread progress functionality, which is independent of any other + code. +2. The thread has executed a full memory barrier. + +These events, of course, need to occur ordered to other memory +operations. The operation of determining this begins by initiating the +thread progress operation. The thread that initiated the thread +progress operation after this poll for the completion of the +operation. Both of these events must occur at least once *after* the +thread progress operation has been initiated, and at least once +*before* the operation has completed in each managed thread. This is +ordered using communication via memory which makes it possible to draw +conclusion about the memory state after the thread progress operation +has completed. Lets call the progress made from initiation to +comletion for "thread progress". + +Assuming that the thread progress functionality is efficient, a lot of +algorithms can both be simplified and made more efficient than using +the first approach that comes to mind. A couple of examples follows. + +By being able to determine when the first event above has occurred we +can easily know when all managed threads have completed accesses to a +data structure. This can be determined the following way. We have an +implementation of some functionality `F` using a data structure +`D`. The reference to `D` is always looked up before `D` is being +accessed, and the references to `D` is always dropped before we leave +the code implementing `F`. If we remove the possibility to look up `D` +and then wait until the first event has occurred in all managed +threads, no managed threads can have any references to the data +structure `D`. This could for example have been achieved by using +reference counting, but the cache line containing the reference +counter would in this case be ping ponged between all processors +accessing `D` at every access. + +By being able to determine when the second event has occurred it is +quite easy to do complex modifications of memory that needs to be seen +consistently by other threads without having to resort to locking. By +doing the modifications, then issuing a full memory barrier, then wait +until the second event has occurred in all managed threads, and then +publish the modifications, we know that all managed threads reading +this memory will get a consistent view of the modifications. Managed +threads reading this will not have to issue any extra memory barriers +at all. + +Implementation of the Thread Progress Functionality +--------------------------------------------------- + +### Requirement on the Implementation ### + +In order to be able to determine when all managed threads have reached +the states that we are interested in we need to communicate between +all involved threads. We of course want to minimize this +communication. + +We also want threads to be able to determine when thread progress has +been made relatively fast. That is we need to have some balance +between comunication overhead and time to complete the operation. + +### API ### + +I will only present the most important functions in the API here. + +* `ErtsThrPrgrVal erts_thr_progress_later(void)` - Initiation of the + operation. The thread progress value returned can be used testing + for completion of the operation. +* `int erts_thr_progress_has_reached(ErtsThrPrgrVal val)` - Returns + a non zero value when we have reached the thread progress value + passed as argument. That is, when a non zero value is returned the + operation has completed. + +When a thread calls `my_val = erts_thr_progress_later()` and waits for +`erts_thr_progress_has_reached(my_val)` to return a non zero value it +knows that thread progress has been made. + +While waiting for `erts_thr_progress_has_reached()` to return a non +zero value we typically do not want to block waiting, but instead want +to continue working with other stuff. If we run out of other stuff to +work on we typically do want to block waiting until we have reached +the thread progress value that we are waiting for. In order to be able +to do this we provide functionality for waking up a thread when a +certain thread progress value has been reached: + +* `void erts_thr_progress_wakeup(ErtsSchedulerData *esdp, + ErtsThrPrgrVal val)` - Request wake up. The calling thread will be + woken when thread progress has reached val. + +Managed threads frequently need to update their thread progress by +calling the following functions: + +* `int erts_thr_progress_update(ErtsSchedulerData *esdp)` - Update + thread progress. If a non zero value is returned + `erts_thr_progress_leader_update()` has to be called without any + locks held. +* `int erts_thr_progress_leader_update(ErtsSchedulerData *esdp)` - + Leader update thread progress. + +Unmanaged threads can delay thread progress beeing made: + +* `ErtsThrPrgrDelayHandle erts_thr_progress_unmanaged_delay(void)` - + Delay thread progress. +* `void erts_thr_progress_unmanaged_continue(ErtsThrPrgrDelayHandle + handle)` - Let thread progress continue. + +Scheduler threads can schedule an operation to be executed by the +scheduler itself when thread progress has been made: + +* `void erts_schedule_thr_prgr_later_op(void (*funcp)(void *), void + *argp, ErtsThrPrgrLaterOp *memp)` - Schedule a call to `funcp`. The + call `(*funcp)(argp)` will be executed when thread progress has been + made since the call to `erts_schedule_thr_prgr_later_op()` was + made. + +### Implementation ### + +In order to determine when the events has happened we use a global +counter that is incremented when all managed threads have called +`erts_thr_progress_update()` (or `erts_thr_progress_leader_update()`). +This could naively be implemented using a "thread confirmed" counter. +This would however cause an explosion of communication where all +involved processors would need to communicate with each other at each +update. + +Instead of confirming at a global location each thread confirms that +it accepts in increment of the global counter in its own cache +line. These confirmation cache lines are located in sequence in an +array, and each confirmation cache line will only be written by one +and only one thread. One of the managed threads always have the leader +responsibility. This responsibility may jump between threads, but as +long as there are some activity in the system always one of them will +have the leader responsibility. The thread with the leader +responsibility will call `erts_thr_progress_leader_update()` which +will check that all other threads have confirmed an increment of the +global counter before doing the increment of the global counter. The +leader thread is the only thread reading the confirmation cache +lines. + +Doing it this way we will get a communication pattern of information +going from the leader thread out to all other managed threads and then +back from the other threads to the leader thread. This since only the +leader thread will write to the global counter and all other threads +will only read it, and since each confirmation cache lines will only +be written by one specific thread and only read by the leader +thread. When each managed thread is distributed over different +processors, the communication between processors will be a reflection +of this communication pattern between threads. + +The value returned from `erts_thr_progress_later()` equals the, by +this thread, latest confirmed value plus two. The global value may be +latest confirmed value or latest confirmed value minus one. In order +to be certain that all other managed threads actually will call +`erts_thr_progress_update()` at least once before we reach the value +returned from `erts_thr_progress_later()`, the global counter plus one +is not enough. This since all other threads may already have confirmed +current global value plus one at the time when we call +`erts_thr_progress_later()`. They are however guaranteed not to have +confirmed global value plus two at this time. + +The above described implementation more or less minimizes the +comunication needed before we can increment the global counter. The +amount of communication in the system due to the thread progress +functionality however also depend on the frequency with which managed +threads call `erts_thr_progress_update()`. Today each scheduler thread +calls `erts_thr_progress_update()` more or less each time an Erlang +process is scheduled out. One way of further reducing communication +due to the thread progress functionality is to only call +`erts_thr_progress_update()` every second, or third time an Erlang +process is scheduled out, or even less frequently than that. However, +by doing updates of thread progress less frequently all operations +depending on the thread progress functionality will also take a longer +time. + +#### Delay of Thread Progress by Unmanaged Threads #### + +In order to implement delay of thread progress from unmanaged threads +we use two reference counters. One being `current` and one being +`waiting`. When an unmanaged thread wants to delay thread progress it +increments `current` and gets a handle back to the reference counter +it incremented. When it later wants to enable continuation of thread +progress it uses the handle to decrement the reference counter it +previously incremented. + +When the leader threads is about to increment the global thread +progress counter it verifies that the `waiting` counter is zero before +doing so. If not zero, the leader isn't allowed to increment the +global counter, and needs to wait before it can do this. When it is +zero, it swaps the `waiting` and `current` counters before increasing +the global counter. From now on the new `waiting` counter will +decrease, so that it eventualy will reach zero, making it possible to +increment the global counter the next time. If we only used one +reference counter it would potentially be held above zero for ever by +different unmanaged threads. + +When an unmanaged thread increment the `current` counter it will not +prevent the next increment of the global counter, but instead the +increment after that. This is sufficient since the global counter +needs to be incremented two times before thread progress has been +made. It is also desirable not to prevent the first increment, since +the likelyhood increases that the delay is withdrawn before any +increment of the global counter is delayed. That is, the operation +will cause as little disruption as possible. + +However, this feature of delaying thread progress from unmanaged +threads should preferably be used as little as possible, since heavy +use of it will cause contention on the reference counter cache +lines. The functionality is however very useful in code which normally +only executes in managed threads, but which may under some infrequent +circumstances be executed in other threads. + +#### Overhead #### + +The overhead caused by the thread progress functionality is more or +less fixed using the same amount of schedulers regardless of the +number of uses of the functionality. Already today quite a lot of +functionality use it, and we plan to use it even more. When rewriting +old implementations of ERTS internal functionality to use the thread +progress functionality, this implies removing communication in the old +implementation. Otherwise it is simply no point rewriting the old +implementation to use the thread progress functionality. Since the +thread progress overhead is more or less fixed, the rewrite will cause +a reduction of the total communication in the system. + +##### An Example ##### + +The main structure of an ETS table was originally managed using +reference counting. Already a long time ago we replaced this strategy +since the reference counter caused contention on each access of the +table. The solution used was to schedule "confirm deletion" jobs on +each scheduler in order to know when it was safe to deallocate the +table structure of a removed table. These confirm deletion jobs needed +to be allocated. That is, we had to allocate and deallocate as many +blocks as schedulers in order to deallocate one block. This of course +was a quite an expensive operation, but we only needed to do this once +when removing a table. It was more important to get rid of the +contention on the reference counter which was present on every +operation on the table. + +When the thread progress functionality had been introduced, we could +remove the code implementing the "confirm deletion" jobs, and then +just schedule a thread progress later operation which deallocates the +structure. Besides simplifying the code a lot, we got an increase of +more than 10% of the number of transactions per second handled on a +mnesia tpcb benchmark executing on a quad core machine. diff --git a/erts/emulator/internal_doc/Tracing.md b/erts/emulator/internal_doc/Tracing.md new file mode 100644 index 0000000000..30bc5327a7 --- /dev/null +++ b/erts/emulator/internal_doc/Tracing.md @@ -0,0 +1,220 @@ +Non-blocking trace setting +========================== + +Introduction +------------ + +Before OTP R16 when trace settings were changed by `erlang:trace_pattern`, +all other execution in the VM were halted while the trace operation +was carried out in single threaded mode. Similar to code loading, this +can impose a severe problem for availability that grows with the +number of cores. + +In OTP R16, trace breakpoints are set in the code without blocking the +VM. Erlang processes may continue executing undisturbed in parallel +during the entire operation. The same base technique is used as for +code loading. A staging area of breakpoints is prepared and then made +active with a single atomic operation. + + +Redesign of Breakpoint Wheel +---------------------------- + +To make it easier to manage breakpoints without single threaded mode a +redesign of the breakpoint mechanism has been made. The old +"breakpoint wheel" data structure was a circular double-linked list of +breakpoints for each instrumented function. It was invented before the +SMP emulator. To support it in the SMP emulator, is was essentially +expanded to one breakpoint wheel per scheduler. As more breakpoint +types have been added, the implementation have become messy and hard +to understand and maintain. + +In the new design the old wheel was dropped and instead replaced by +one struct (`GenericBp`) to hold the data for all types of breakpoints +for each instrumented function. A bit-flag field is used to indicate +what different type of break actions that are enabled. + + +Same Same but Different +----------------------- +Even though `trace_pattern` use the same technique as the non-blocking +code loading with replicated generations of data structures and an +atomic switch, the implementations are quite separate from each +other. One initial idea was to use the existing mechanism of code +loading to do a dummy load operation that would make a copy of the +affected modules. That copy could then be instrumented with +breakpoints before making it reachable with the same atomic switch as +done for code loading. This approach seems straight forward but has a +number of shortcomings, one being the large memory footprint when many +modules are instrumented. Another problem is how execution will reach +the new instrumented code. Normally loaded code can only be reached +through external functions calls. Trace settings must be activated +instantaneously without the need of external function calls. + +The choosen solution is instead for tracing to use the technique of +replication applied on the data structures for breakpoints. Two +generations of breakpoints are kept and indentified by index of 0 and +1. The global atomic variables `erts_active_bp_index` will determine +which generation of breakpoints running code will use. + +### Atomicy Without Atomic Operations + +Not using the code loading generations (or any other code duplication) +means that `trace_pattern` must at some point write to the active beam +code in order for running processes to reach the staged breakpoints +structures. This can be done with one single atomic write operation +per instrumented function. The beam instruction words are however read +with normal memory loads and not through the atomic API. The only +guarantee we need is that the written instruction word is seen as +atomic. Either fully written or not at all. This is true for word +aligned write operation on all hardware architectures we use. + + +Adding a new Breakpoint +----------------------- +This is a simplified sequence describing what `trace_pattern` goes +through when adding a new breakpoint. + +1. Seize exclusive code write permission (suspend process until we get it). + +2. Allocate breakpoint structure `GenericBp` including both generations. + Set the active part as disabled with a zeroed flagfield. Save the original + instruction word in the breakpoint. + +3. Write a pointer to the breakpoint at offset -4 from the first + instruction "func_info" header. + +4. Set the staging part of the breakpoint as enabled with specified + breakpoint data. + +5. Wait for thread progress. + +6. Write a `op_i_generic_breakpoint` as the first instruction for the function. + This instruction will execute the breakpoint that it finds at offset -4. + +7. Wait for thread progress. + +8. Commit the breadpoint by switching `erts_active_bp_index`. + +9. Wait for thread progress. + +10. Prepare for next call to `trace_pattern` by updating the new staging part + (the old active) of the breakpoint to be identic to the the new active part. + +11. Release code write permission and return from `trace_pattern`. + + +The code write permission "lock" seized in step 1 is the same as used +by code loading. This will ensure that only one process at a time can +stage new trace settings but it will also prevent concurrent code +loading and make sure we see a consistent view of the beam code during +the entire sequence. + +Between step 6 and 8, runninng processes might execute the written +`op_i_generic_breakpoint` instruction. They will get the breakpoint +structure written in step 3, read `erts_active_bp_index` and execute +the corresponding part of the breakpoint. Before the switch in step 8 +becomes visible they will however execute the disabled part of the +breakpoint structure and do nothing other than executing the saved +original instruction. + + +To Updating and Remove Breakpoints +---------------------------------- + +The above sequence did only describe adding a new breakpoint. We do +basically the same sequence to update the settings of an existing +breakpoint except step 2,3 and 6 can be skipped as it has already been +done. + +To remove a breakpoint some more steps are needed. The idea is to +first stage the breakpoint as disabled, do the switch, wait for thread +progress and then remove the disabled breakpoint by restoring the +original beam instruction. + +Here is a more complete sequence that contains both adding, updating +and removing breakpoints. + +1. Seize exclusive code write permission (suspend process until we get it). + +2. Allocate new breakpoint structures with a disabled active part and + the original beam instruction. Write a pointer to the breakpoint in + "func_info" header at offset -4. + +3. Update the staging part of all affected breakpoints. Disable + breakpoints that are to be removed. + +4. Wait for thread progress. + +5. Write a `op_i_generic_breakpoint` as the first instruction for all + functions with new breakpoints. + +6. Wait for thread progress. + +7. Commit all staged breadpoints by switching `erts_active_bp_index`. + +8. Wait for thread progress. + + +9. Restore original beam instruction for disabled breakpoints. + +10. Wait for thread progress. + +11. Prepare for next call to `trace_pattern` by updating the new + staging area (the old active) for all enabled breakpoints. + +12. Deallocate disabled breakpoint structures. + +13. Release code write permission and return from `trace_pattern`. + + +### All that Waiting for Thread Progress + +There are four rounds of waiting for thread progress in the above +sequence. In the code loading sequence we sacrificed memory overhead +of three generations to avoid a second round of thread progress. The +latency of `trace_pattern` should not be such a big problem for +however, as it is normally not called in a rapid sequence. + +The waiting in step 4 is to make sure all threads will see an updated +view of the breakpoint structures once they become reachable through +the `op_i_generic_breakpoint` instruction written in step 5. + +The waiting in step 6 is to make the activation of the new trace +settings "as atomic as possible". Different cores might see the new +value of `erts_active_bp_index` at different times as it is read +without any memory barrier. But this is the best we can do without +more expensive thread synchronization. + +The waiting in step 8 is to make sure we dont't restore the original +bream instructions for disabled breakpoints until we know that no +thread is still accessing the old enabled part of a disabled +breakpoint. + +The waiting in step 10 is to make sure no lingering thread is still +accessing disabled breakpoint structures to be deallocated in step +12. + + +Global Tracing +-------------- + +Call tracing with `global` option only affects external function +calls. This was earlier handled by inserting a special trace +instruction in export entries without the use of breakpoints. With the +new non-blocking tracing we want to avoid special handling for global +tracing and make use of the staging and atomic switching within the +breakpoint mechanism. The solution was to create the same type of +breakpoint structure for a global call trace. The difference to local +tracing is that we insert the `op_i_generic_breakpoint` instruction +(with its pointer at offset -4) in the export entry rather than in the +code. + + +Future work +----------- + +We still go to single threaded mode when new code is loaded for a +module that is traced, or when loading code when there is a default +trace pattern set. That is not impossible to fix, but that requires +much closer cooperation between tracing BIFs and the loader BIFs. diff --git a/erts/emulator/internal_doc/dec.erl b/erts/emulator/internal_doc/dec.erl index bb69e6e81b..8ce83fa402 100644 --- a/erts/emulator/internal_doc/dec.erl +++ b/erts/emulator/internal_doc/dec.erl @@ -1,19 +1,19 @@ -%% -*- coding: utf-8 -*- %% %% %CopyrightBegin% %% -%% Copyright Ericsson AB 2000-2013. All Rights Reserved. +%% Copyright Ericsson AB 2000-2016. All Rights Reserved. %% -%% The contents of this file are subject to the Erlang Public License, -%% Version 1.1, (the "License"); you may not use this file except in -%% compliance with the License. You should have received a copy of the -%% Erlang Public License along with this software. If not, it can be -%% retrieved online at http://www.erlang.org/. +%% Licensed under the Apache License, Version 2.0 (the "License"); +%% you may not use this file except in compliance with the License. +%% You may obtain a copy of the License at %% -%% Software distributed under the License is distributed on an "AS IS" -%% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See -%% the License for the specific language governing rights and limitations -%% under the License. +%% http://www.apache.org/licenses/LICENSE-2.0 +%% +%% Unless required by applicable law or agreed to in writing, software +%% distributed under the License is distributed on an "AS IS" BASIS, +%% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +%% See the License for the specific language governing permissions and +%% limitations under the License. %% %% %CopyrightEnd% %% @@ -65,18 +65,17 @@ dec() -> "*~n" "* Copyright Ericsson AB 1999-2010. All Rights Reserved.~n" "*~n" - "* The contents of this file are subject to the Erlang Public License,~n" - "* Version 1.1, (the \"License\"); you may not use this file except in~n" - "* compliance with the License. You should have received a copy of the~n" - "* Erlang Public License along with this software. If not, it can be~n" - "* retrieved online at http://www.erlang.org/.~n" + "* Licensed under the Apache License, Version 2.0 (the \"License\");~n" + "* you may not use this file except in compliance with the License.~n" + "* You may obtain a copy of the License at~n" + "*~n" + "* http://www.apache.org/licenses/LICENSE-2.0~n" "*~n" - "* Software distributed under the License is distributed on an " - "\"AS IS\"~n" - "* basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See~n" - "* the License for the specific language governing rights and " - "limitations~n" - "* under the License.~n" + "* Unless required by applicable law or agreed to in writing, software~n" + "* distributed under the License is distributed on an \"AS IS\" BASIS,~n" + "* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.~n" + "* See the License for the specific language governing permissions and~n" + "* limitations under the License.~n" "*~n" "* %CopyrightEnd%~n" "*/~n" diff --git a/erts/emulator/internal_doc/erl_ext_dist.txt b/erts/emulator/internal_doc/erl_ext_dist.txt index 1cbbbcc7b8..0b9a783069 100644 --- a/erts/emulator/internal_doc/erl_ext_dist.txt +++ b/erts/emulator/internal_doc/erl_ext_dist.txt @@ -1,18 +1,19 @@ %CopyrightBegin% - Copyright Ericsson AB 1997-2009. All Rights Reserved. + Copyright Ericsson AB 1997-2016. All Rights Reserved. - The contents of this file are subject to the Erlang Public License, - Version 1.1, (the "License"); you may not use this file except in - compliance with the License. You should have received a copy of the - Erlang Public License along with this software. If not, it can be - retrieved online at http://www.erlang.org/. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at - Software distributed under the License is distributed on an "AS IS" - basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See - the License for the specific language governing rights and limitations - under the License. + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. %CopyrightEnd% |