From e086297a8638a9f57983fd707f6c75e3d4b6edf3 Mon Sep 17 00:00:00 2001 From: Sverker Eriksson Date: Wed, 8 Jan 2014 15:56:29 +0100 Subject: erts: Add internal docs for code loading and tracing --- erts/emulator/internal_doc/CodeLoading.md | 186 +++++++++++++++++++++++++ erts/emulator/internal_doc/Tracing.md | 220 ++++++++++++++++++++++++++++++ 2 files changed, 406 insertions(+) create mode 100644 erts/emulator/internal_doc/CodeLoading.md create mode 100644 erts/emulator/internal_doc/Tracing.md diff --git a/erts/emulator/internal_doc/CodeLoading.md b/erts/emulator/internal_doc/CodeLoading.md new file mode 100644 index 0000000000..151b9cd57c --- /dev/null +++ b/erts/emulator/internal_doc/CodeLoading.md @@ -0,0 +1,186 @@ +Non-Blocking Code Loading +========================= + +Introduction +------------ + +Before OTP R16 when an Erlang code module was loaded, all other +execution in the VM were halted while the load operation was carried +out in single threaded mode. This might not be a big problem for +initial loading of modules during VM boot, but it can be a severe +problem for availability when upgrading modules or adding new code on +a VM with running payload. This problem grows with the number of cores +as both the time it takes to wait for all schedulers to stop increases +as well as the potential amount of halted ongoing work. + +In OTP R16, modules are loaded without blocking the VM. +Erlang processes may continue executing undisturbed in parallel during +the entire load operation. The code loading is carried out by a normal +Erlang process that is scheduled like all the others. The load +operation is completed by making the loaded code visible to all +processes in a consistent way with one single atomic +instruction. Non-blocking code loading will improve real-time +characteristics when modules are loaded/upgraded on a running SMP +system. + + +The Load Phases +--------------- + +The loading of a module is divided into two phases; a *prepare phase* +and a *finishing phase*. The prepare phase contains reading the BEAM +file format and all the preparations of the loaded code that can +easily be done without interference with the running code. The +finishing phase will make the loaded (and prepared) code accessible +from the running code. Old module versions (replaced or deleted) will +also be made inaccessible by the finishing phase. + +The prepare phase is designed to allow several "loader" processes to +prepare separate modules in parallel while the finishing phase can +only be done by one loader process at a time. A second loader process +trying to enter finishing phase will be suspended until the first +loader is done. This will only block the process, the scheduler is +free to schedule other work while the second loader is waiting. (See +`erts_try_seize_code_write_permission` and +`erts_release_code_write_permission`). + +The ability to prepare several modules in parallel is not currently +used as almost all code loading is serialized by the code_server +process. The BIF interface is however prepared for this. + + erlang:prepare_loading(Module, Code) -> LoaderState + erlang:finish_loading([LoaderState]) + +The idea is that `prepare_loading` could be called in parallel for +different modules and returns a "magic binary" containing the internal +state of each prepared module. Function `finish_loading` could take a +list of such states and do the finishing of all of them in one go. + +Currenlty we use the legacy BIF `erlang:load_module` which is now +implemented in Erlang by calling the above two functions in +sequence. Function `finish_loading` is limited to only accepts a list +with one module state as we do not yet use the multi module loading +feature. + + +The Finishing Sequence +---------------------- + +During VM execution, code is accessed through a number of data +structures. These *code access structures* are + +* Export table. One entry for every exported function. +* Module table. One entry for each loaded module. +* "beam_catches". Identifies jump destinations for catch instructions. +* "beam_ranges". Map code address to function and line in source file. + +The most frequently used of these structures is the export table that +is accessed in run time for every executed external function call to +get the address of the callee. For performance reasons, we want to +access all these structures without any overhead from thread +synchronization. Earlier this was solved with an emergency break. Stop +the entire VM to mutate these code access structures, otherwise treat +them as read-only. + +The solution in R16 is instead to *replicate* the code access +structures. We have one set of active structures read by the running +code. When new code is loaded the active structures are copied, the +copy is updated to include the newly loaded module and then a switch +is made to make the updated copy the new active set. The active set is +identified by a single global atomic variable +`the_active_code_index`. The switch can thus be made by a single +atomic write operation. The running code have to read this atomic +variable when using the active access structures, which means one +atomic read operation per external function call for example. The +performance penalty from this extra atomic read is however very small +as it can be done without any memory barriers at all (as described +below). With this solution we also preserve the transactional feature +of a load operation. Running code will never see the intermediate +result of a half loaded module. + +The finishing phase is carried out in the following sequence by the +BIF `erlang:finish_loading`: + +1. Seize exclusive code write permission (suspend process if needed + until we get it). + +2. Make a full copy of all the active access structures. This copy is + called the staging area and is identified by the global atomic + variable `the_staging_code_index`. + +3. Update all access structures in the staging area to include the + newly prepared module. + +4. Schedule a thread progress event. That is a time in the future when + all schedulers have yielded and executed a full memory barrier. + +5. Suspend the loader process. + +6. After thread progress, commit the staging area by assigning + `the_staging_code_index` to `the_active_code_index`. + +7. Release the code write permission allowing other processes to stage + new code. + +8. Resume the loader process allowing it to return from + `erlang:finish_loading`. + + +### Thread Progress + +The waiting for thread progress in 4-6 is necessary in order for +processes to read `the_active_code_index` atomic during normal +execution without any expensive memory barriers. When we write a new +value into `the_active_code_index` in step 6, we know that all +schedulers will see an updated and consistent view of all the new +active access structures once they become reachable through +`the_active_code_index`. + +The total lack of memory barrier when reading `the_active_code_index` +has one interesting consequence however. Different processes may see +the new code at different point in time depending on when different +cores happen to refresh their hardware caches. This may sound unsafe +but it actually does not matter. The only property we must guarantee +is that the ability to see the new code must spread with process +communication. After receiving a message that was triggered by new +code, the receiver must be guaranteed to also see the new code. This +will be guaranteed as all types of process communication involves +memory barriers in order for the receiver to be sure to read what the +sender has written. This implicit memory barrier will then also make +sure that the receiver reads the new value of `the_active_code_index` +and thereby also sees the new code. This is true for all kinds of +inter process communication (TCP, ETS, process name registering, +tracing, drivers, NIFs, etc) not just Erlang messages. + +### Code Index Reuse + +To optimize the copy operation in step 2, code access structures are +reused. In current solution we have three sets of code access +structures, identified by a code index of 0, 1 and 2. These indexes +are used in a round robin fashion. Instead of having to initialize a +completely new copy of all access structures for every load operation +we just have to update with the changes that have happened since the +last two code load operations. We could get by with only two code +indexes (0 and 1), but that would require yet another round of waiting +for thread progress before step 2 in the `finish_loading` sequence. We +cannot start reusing a code index as staging area until we know that +no lingering scheduler thread is still using it as the active code +index. With three generations of code indexes, the waiting for thread +progress in step 4-6 will give this guarantee for us. Thread progress +will wait for all running schedulers to reschedule at least one +time. No ongoing execution reading code access structures reached from +an old value of `the_active_code_index` can exist after a second round +of thread progress. + +The design choice between two or three generations of code access +structures is a trade-off between memory consumption and code loading +latency. + +### A Consistent Code View + +Some native BIFs may need to get a consistent snapshot view of the +active code. To do this it is important to only read +`the_active_code_index` one time and then use that index value for all +code accessing during the BIF. If a load operation is executed in +parallel, reading `the_active_code_index` a second time might result +in a different value, and thereby a different view of the code. diff --git a/erts/emulator/internal_doc/Tracing.md b/erts/emulator/internal_doc/Tracing.md new file mode 100644 index 0000000000..30bc5327a7 --- /dev/null +++ b/erts/emulator/internal_doc/Tracing.md @@ -0,0 +1,220 @@ +Non-blocking trace setting +========================== + +Introduction +------------ + +Before OTP R16 when trace settings were changed by `erlang:trace_pattern`, +all other execution in the VM were halted while the trace operation +was carried out in single threaded mode. Similar to code loading, this +can impose a severe problem for availability that grows with the +number of cores. + +In OTP R16, trace breakpoints are set in the code without blocking the +VM. Erlang processes may continue executing undisturbed in parallel +during the entire operation. The same base technique is used as for +code loading. A staging area of breakpoints is prepared and then made +active with a single atomic operation. + + +Redesign of Breakpoint Wheel +---------------------------- + +To make it easier to manage breakpoints without single threaded mode a +redesign of the breakpoint mechanism has been made. The old +"breakpoint wheel" data structure was a circular double-linked list of +breakpoints for each instrumented function. It was invented before the +SMP emulator. To support it in the SMP emulator, is was essentially +expanded to one breakpoint wheel per scheduler. As more breakpoint +types have been added, the implementation have become messy and hard +to understand and maintain. + +In the new design the old wheel was dropped and instead replaced by +one struct (`GenericBp`) to hold the data for all types of breakpoints +for each instrumented function. A bit-flag field is used to indicate +what different type of break actions that are enabled. + + +Same Same but Different +----------------------- +Even though `trace_pattern` use the same technique as the non-blocking +code loading with replicated generations of data structures and an +atomic switch, the implementations are quite separate from each +other. One initial idea was to use the existing mechanism of code +loading to do a dummy load operation that would make a copy of the +affected modules. That copy could then be instrumented with +breakpoints before making it reachable with the same atomic switch as +done for code loading. This approach seems straight forward but has a +number of shortcomings, one being the large memory footprint when many +modules are instrumented. Another problem is how execution will reach +the new instrumented code. Normally loaded code can only be reached +through external functions calls. Trace settings must be activated +instantaneously without the need of external function calls. + +The choosen solution is instead for tracing to use the technique of +replication applied on the data structures for breakpoints. Two +generations of breakpoints are kept and indentified by index of 0 and +1. The global atomic variables `erts_active_bp_index` will determine +which generation of breakpoints running code will use. + +### Atomicy Without Atomic Operations + +Not using the code loading generations (or any other code duplication) +means that `trace_pattern` must at some point write to the active beam +code in order for running processes to reach the staged breakpoints +structures. This can be done with one single atomic write operation +per instrumented function. The beam instruction words are however read +with normal memory loads and not through the atomic API. The only +guarantee we need is that the written instruction word is seen as +atomic. Either fully written or not at all. This is true for word +aligned write operation on all hardware architectures we use. + + +Adding a new Breakpoint +----------------------- +This is a simplified sequence describing what `trace_pattern` goes +through when adding a new breakpoint. + +1. Seize exclusive code write permission (suspend process until we get it). + +2. Allocate breakpoint structure `GenericBp` including both generations. + Set the active part as disabled with a zeroed flagfield. Save the original + instruction word in the breakpoint. + +3. Write a pointer to the breakpoint at offset -4 from the first + instruction "func_info" header. + +4. Set the staging part of the breakpoint as enabled with specified + breakpoint data. + +5. Wait for thread progress. + +6. Write a `op_i_generic_breakpoint` as the first instruction for the function. + This instruction will execute the breakpoint that it finds at offset -4. + +7. Wait for thread progress. + +8. Commit the breadpoint by switching `erts_active_bp_index`. + +9. Wait for thread progress. + +10. Prepare for next call to `trace_pattern` by updating the new staging part + (the old active) of the breakpoint to be identic to the the new active part. + +11. Release code write permission and return from `trace_pattern`. + + +The code write permission "lock" seized in step 1 is the same as used +by code loading. This will ensure that only one process at a time can +stage new trace settings but it will also prevent concurrent code +loading and make sure we see a consistent view of the beam code during +the entire sequence. + +Between step 6 and 8, runninng processes might execute the written +`op_i_generic_breakpoint` instruction. They will get the breakpoint +structure written in step 3, read `erts_active_bp_index` and execute +the corresponding part of the breakpoint. Before the switch in step 8 +becomes visible they will however execute the disabled part of the +breakpoint structure and do nothing other than executing the saved +original instruction. + + +To Updating and Remove Breakpoints +---------------------------------- + +The above sequence did only describe adding a new breakpoint. We do +basically the same sequence to update the settings of an existing +breakpoint except step 2,3 and 6 can be skipped as it has already been +done. + +To remove a breakpoint some more steps are needed. The idea is to +first stage the breakpoint as disabled, do the switch, wait for thread +progress and then remove the disabled breakpoint by restoring the +original beam instruction. + +Here is a more complete sequence that contains both adding, updating +and removing breakpoints. + +1. Seize exclusive code write permission (suspend process until we get it). + +2. Allocate new breakpoint structures with a disabled active part and + the original beam instruction. Write a pointer to the breakpoint in + "func_info" header at offset -4. + +3. Update the staging part of all affected breakpoints. Disable + breakpoints that are to be removed. + +4. Wait for thread progress. + +5. Write a `op_i_generic_breakpoint` as the first instruction for all + functions with new breakpoints. + +6. Wait for thread progress. + +7. Commit all staged breadpoints by switching `erts_active_bp_index`. + +8. Wait for thread progress. + + +9. Restore original beam instruction for disabled breakpoints. + +10. Wait for thread progress. + +11. Prepare for next call to `trace_pattern` by updating the new + staging area (the old active) for all enabled breakpoints. + +12. Deallocate disabled breakpoint structures. + +13. Release code write permission and return from `trace_pattern`. + + +### All that Waiting for Thread Progress + +There are four rounds of waiting for thread progress in the above +sequence. In the code loading sequence we sacrificed memory overhead +of three generations to avoid a second round of thread progress. The +latency of `trace_pattern` should not be such a big problem for +however, as it is normally not called in a rapid sequence. + +The waiting in step 4 is to make sure all threads will see an updated +view of the breakpoint structures once they become reachable through +the `op_i_generic_breakpoint` instruction written in step 5. + +The waiting in step 6 is to make the activation of the new trace +settings "as atomic as possible". Different cores might see the new +value of `erts_active_bp_index` at different times as it is read +without any memory barrier. But this is the best we can do without +more expensive thread synchronization. + +The waiting in step 8 is to make sure we dont't restore the original +bream instructions for disabled breakpoints until we know that no +thread is still accessing the old enabled part of a disabled +breakpoint. + +The waiting in step 10 is to make sure no lingering thread is still +accessing disabled breakpoint structures to be deallocated in step +12. + + +Global Tracing +-------------- + +Call tracing with `global` option only affects external function +calls. This was earlier handled by inserting a special trace +instruction in export entries without the use of breakpoints. With the +new non-blocking tracing we want to avoid special handling for global +tracing and make use of the staging and atomic switching within the +breakpoint mechanism. The solution was to create the same type of +breakpoint structure for a global call trace. The difference to local +tracing is that we insert the `op_i_generic_breakpoint` instruction +(with its pointer at offset -4) in the export entry rather than in the +code. + + +Future work +----------- + +We still go to single threaded mode when new code is loaded for a +module that is traced, or when loading code when there is a default +trace pattern set. That is not impossible to fix, but that requires +much closer cooperation between tracing BIFs and the loader BIFs. -- cgit v1.2.3