diff options
Diffstat (limited to 'erts/emulator/internal_doc/CarrierMigration.md')
-rw-r--r-- | erts/emulator/internal_doc/CarrierMigration.md | 201 |
1 files changed, 201 insertions, 0 deletions
diff --git a/erts/emulator/internal_doc/CarrierMigration.md b/erts/emulator/internal_doc/CarrierMigration.md new file mode 100644 index 0000000000..b93c11c6ec --- /dev/null +++ b/erts/emulator/internal_doc/CarrierMigration.md @@ -0,0 +1,201 @@ +Carrier Migration +================= + +The ERTS memory allocators manage memory blocks in two types of raw +memory chunks. We call these chunks of raw memory +*carriers*. Singleblock carriers which only contain one large block, +and multiblock carriers which contain multiple blocks. A carrier is +typically created using `mmap()` on unix systems. However, how a +carrier is created is of minor importance. An allocator instance +typically manages a mixture of single- and multiblock carriers. + +Problem +------- + +When a carrier is empty, i.e. contains only one large free block, it +is deallocated. Since multiblock carriers can contain both allocated +blocks and free blocks at the same time, an allocator instance might +be stuck with a large amount of poorly utilized carriers if the memory +load decrease. After a peak in memory usage it is expected that not +all memory can be returned since the blocks still allocated is likely +to be dispersed over multiple carriers. Such poorly utilized carriers +can usually be reused if the memory load increase again. However, +since each scheduler thread manages its own set of allocator +instances, and memory load is not necessarily connected to CPU load we +might get into a situation where there are lots of poorly utilized +multiblock carriers on some allocator instances while we need to +allocate new multiblock carriers on other allocator instances. In +scenarios like this, the demand for multiblock carriers in the system +might increase at the same time as the actual memory demand in the +system has decreased which is both unwanted and quite unexpected for +the end user. + +Solution +-------- + +In order to prevent scenarios like this we've implemented support for +migration of multiblock carriers between allocator instances of the +same type. + +### Management of Free Blocks ### + +In order to be able to remove a carrier from one allocator instance +and add it to another we need to be able to move references to the +free blocks of the carrier between the allocator instances. The +allocator instance specific data structure referring to the free +blocks it manages often refers to the same carrier from multiple +places. For example, when the address order bestfit strategy is used +this data structure is a binary search tree spanning all carriers that +the allocator instance manages. Free blocks in one specific carrier +can be referred to from potentially every other carrier that is +managed, and the amount of such references can be huge. That is, the +work of removing the free blocks of such a carrier from the search +tree will be huge. One way of solving this could be to not migrate +carriers that contain lots of free blocks, but this would prevent us +from migrating carriers that potentially needs to be migrated in order +to solve the problem we set out to solve. + +By using one data structure of free blocks in each carrier and an +allocator instance wide data structure of carriers managed by the +allocator instance, the work needed in order to remove and add +carriers can be kept to a minimum. When migration of carriers is +enabled on a specific allocator type, we require that an allocation +strategy with such an implementation is used. Currently we've +implemented this for three different allocation strategies. All of +these strategies use a search tree of carriers sorted so that we can +find the carrier with the lowest address that can satisfy the +request. Internally in carriers we use yet another search tree that +either implement address order first fit, address order best fit, +or best fit. The abbreviations used for these different allocation +strategies are `aoff`, and `aoffcaobf`, `aoffcbf`. + +### Carrier Pool ### + +In order to migrate carriers between allocator instances we move them +through a pool of carriers. In order for a carrier migration to +complete, one scheduler needs to move the carrier into the pool, and +another scheduler needs to take the carrier out of the pool. + +The pool is implemented as a lock free, circular, double linked, +list. The list contains a sentinel which is used as the starting point +when inserting to, or fetching from the pool. Carriers in the pool are +elements in this list. + +The list can be modified by all scheduler threads +simultaneously. During modifications the double linked list is allowed +to get a bit "out of shape". For example, following the `next` pointer +to the next element and then following the `prev` pointer does not +always take you back to were you started. The following is however +always true: + +* Repeatedly following `next` pointers will eventually take you to the + sentinel. +* Repeatedly following `prev` pointers will eventually take you to the + sentinel. +* Following a `next` or a `prev` pointer will take you to either an + element in the pool, or an element that used to be in the pool. + +When inserting a new element we search for a place to insert the +element by only following `next` pointers, and we always begin by +skipping the first element encountered. When trying to fetch an +element we do the same thing, but instead only follow `prev` pointers. + +By going different directions when inserting and fetching, we avoid +contention between threads inserting and threads fetching as much as +possible. By skipping one element when we begin searching, we preserve +the sentinel unmodified as much as possible. This is beneficial since +all search operations need to read the content of the sentinel. If we +were to modify the sentinel, the cache line containing the sentinel +would unnecessarily be bounced between processors. + +The `prev`, and `next` fields in the elements of the list contains the +value of the pointer, a modification marker, and a deleted +marker. Memory operations on these fields are done using atomic memory +operations. When a thread has set the modification marker in a field, +no-one except the thread that set the marker is allowed to modify the +field. If multiple modification markers needs to be set, we always +begin with `next` fields followed by `prev` fields in the order +following the actual pointers. This guarantees that no deadlocks will +occur. + +When a carrier is being removed from a pool, we mark it with a thread +progress value that needs to be reached before we are allowed to +modify the `next`, and `prev` fields. That is, until we reach this +thread progress we are not allowed to insert the carrier into the pool +again, and we are not allowed to deallocate the carrier. This ensures +that threads inspecting the pool always will be able to traverse the +pool and reach valid elements. Once we have reached the thread +progress value that the carrier was tagged with, we know that no +threads may have references to it via the pool. + +### Migration ### + +There exist one pool for each allocator type enabling migration of +carriers between scheduler specific allocator instances of the same +allocator type. + +Each allocator instance keeps track of the current utilization of its +multiblock carriers. When the utilization falls below the "abandon +carrier utilization limit" it starts to inspect the utilization of the +current carrier when deallocations are made. If also the utilization +of the carrier falls below the "abandon carrier utilization limit" it +unlinks the carrier from its data structure of available free blocks +and inserts the carrier into the pool. + +Since the carrier has been unlinked from the data structure of +available free blocks, no more allocations will be made in the +carrier. The allocator instance putting the carrier into the pool, +however, still has the responsibility of performing deallocations in +it while it remains in the pool. + +Each carrier has a flag field containing information about allocator +instance owning the carrier, a flag indicating if the carrier is in +the pool or not, and a flag indicating if it is busy or not. When the +carrier is in the pool, the owning allocator instance needs to mark it +as busy while operating on it. If another thread inspects it in order +to try to fetch it from the pool, it will abort the fetch if it is +busy. When fetching the carrier from the pool, ownership will changed +and further deallocations in the carrier will be redirected to the new +owner using the delayed dealloc functionality. + +If a carrier in the pool becomes empty, it will be withdrawn from the +pool. All carriers that become empty are also always passed to its +originating allocator instance for deallocation using the delayed +dealloc functionality. Since carriers this way always will be +deallocated by the allocator instance that allocated the carrier the +underlying functionality of allocating and deallocating carriers can +remain simple and doesn't have to bother about multiple threads. In a +NUMA system we will also not mix carriers originating from multiple +NUMA nodes. + +When an allocator instance needs more carrier space, it always begins +by inspecting its own carriers that are waiting for thread progress +before they can be deallocated. If no such carrier could be found, it +then inspects the pool. If no carrier could be fetched from the pool, +it will allocate a new carrier. Regardless of where the allocator +instance gets the carrier from it the just links in the carrier into +its data structure of free blocks. + +### Result ### + +The use of this strategy of abandoning carriers with poor utilization +and reusing these in allocator instances with an increased carrier +demand is extremely effective and completely eliminates the problems +that otherwise sometimes occurred when CPU load dropped while memory +load did not. + +When using the `aoffcaobf` or `aoff` strategies compared to `gf` or +`bf`, we loose some performance since we get more modifications in the +data structure of free blocks. This performance penalty is however +reduced using the `aoffcbf` strategy. A tradeoff between memory +consumption and performance is however inevitable, and it is up to +the user to decide what is most important. + +Further work +------------ + +It would be quite easy to extend this to allow migration of multiblock +carriers between all allocator types. More or less the only obstacle +is maintenance of the statistics information. + + |