From c1c6fbcb1ef741801edeef3b17bb38e52fcaea2e Mon Sep 17 00:00:00 2001 From: Rickard Green Date: Tue, 5 Mar 2013 22:02:17 +0100 Subject: Add misc internal documentation --- erts/emulator/internal_doc/PTables.md | 356 ++++++++++++++++++++++++++++++++++ 1 file changed, 356 insertions(+) create mode 100644 erts/emulator/internal_doc/PTables.md (limited to 'erts/emulator/internal_doc/PTables.md') diff --git a/erts/emulator/internal_doc/PTables.md b/erts/emulator/internal_doc/PTables.md new file mode 100644 index 0000000000..6fe0e7665d --- /dev/null +++ b/erts/emulator/internal_doc/PTables.md @@ -0,0 +1,356 @@ +Process and Port Tables +======================= + +Problems +-------- + +The process table is a mapping from process identifiers to process +structure pointers. The process structure contains miscellaneous +information about a process, as for example pointers to its heap, +message queue, etc. When the runtime system needs to operate on a +process, it looks up the process structure in the process table using +the process identifier. An example of this is when passing a message +to a process. + +The process table has for a very long time just been an array of +pointers to process structures. Since process identifiers internally +in the runtime system are 28-bit integers it is quite easy to map a +process identifier to index into the array. The 28-bits were divided +into two sets. The least significant set of bits was used as index +into the array. The most significant set of bits was only used to be +able to distinguish between a number of identifiers with which map to +the same index in the array. As long as process table sizes of a power +of two was used we had 2^28 unique process identifiers. + +When the first SMP support was implemented, the table still was kept +more or less the same way, but protected by two types of locks. One +lock that protected the whole table against modifications and an array +of locks protecting different parts of the table. The exact locking +strategy previously used isn't interesting. What is interesting is +that it suffered from heavy lock contention especially when lots of +modifications was being made, but also when only performing lookups. + +In order to be able to detect when it is safe to deallocate a +previously used process structure, reference counting of the structure +was used. Also this was problematic, since simultaneous lookups needed +to modify the reference counter which also caused contention on the +cache line where the reference counter was located. This since all +modifications needs to be communicated between all involved +processors. + +The port table is very similar to the process table. The major +difference, at least in concept, is that it is a mapping from port +identifiers to port structures. It had a similar implementation, but +with some differences. Instead of being an array of pointers it was an +array of structures, and instead of being protected by two types of +locks it was only protected by one global lock. This table also +suffered from lock contention in various situations. + +Solution +-------- + +The process table was the major problem to address since processes are +much more frequently used than ports. The first implementation only +implemented this for processes, but since the port table is very +similar and very similar problems occur on the port table, the process +table implementation was later generalized so that it could also be +used implementing the port table. For simplicity I will only talk +about the process table in the following text, but the same will apply +to the port table unless otherwise stated. + +If we disregard the locking issues, the original solution is very +appealing. The mapping from process identifier to index into the array +is very fast, and this property is something we would like to +keep. The vast majority of operations on these tables are lookups so +optimizing for lookups is what we want to do. + +### Lookup ### + +Using a set of bits in the process identifier as index into an array +seems hard to beat. By replacing the array of pointers with an array +of our pointer sized atomic data type, a lookup will consist of the +following: + +1. Mapping the 28-bit integer to an index into the array. + + More about this mapping later. + +2. Read the pointer using an atomic memory operation at determined + index in array. + + On all platforms that we provide atomic memory operations, this is + just a `volatile` read, preventing the compiler to use values in + registers, forcing the a read from memory. + +3. Depending on use, issue appropriate memory barrier. + + A common barrier used is a barrier with acquire semantics. On + x86/x86_64 this maps to a compiler barrier preventing the compiler + to reorder instructions, but on other hardware often some kind of + light weight hardware memory barrier is also needed. + + When comparing with a locked approach, at least one heavy weight + memory barrier will be issued when locking the lock on most, if + not all, hardware architectures (including x86/x86_64), and often + some kind of light weight memory barrier will be issued when + unlocking the lock. + +When looking at this very simple solution with very little overhead +you might wonder why we didn't implement it this way from the +beginning. It all boils down to the read operation of the pointer. We +need some way to know that it is safe to access the memory pointed +to. One way of doing this is to place a reference counter in the +process structure. Increment of the reference counter at lookup needs +to be done atomically with the lookup. A lock can typically provide +this service for us, which was the approach we previously +used. Another approach could be to co-locate the reference counter +with the pointer in the table. The major problem with this approach is +the modifications of the reference counter. This since these +modification would have to be communicated between all involved +processor cause contention on the cache line containing the reference +counter. The new lookup approach above is possible since we can use +the "thread progress" functionality in order to determine when it is +safe to deallocate the process structure. We'll get back to this when +describing deletion in the table. + +Using this new lookup approach we wont modify any memory at all which +is important. A lookup conceptually only read memory, now this is true +in the implementation also which is important from a scalability +perspective. The previous implementation modified the cache line +containing the reference counter two times, and the cache line +containing the corresponding lock two times at each lookup. + +### Modifications of the Table ### + +A lightweight lookup in the table was the most important feature, but +we also wanted to improve modifications of the table. The process +table is modified when a new process is spawned, i.e. a new pointer is +inserted into the table, and when a process terminates, i.e. a pointer +is deleted in the table. + +Assuming that we spawn fewer processes than the maximum amount of +unique process identifiers in the system, one has always been able to +determine the order of process creation just by comparing process +identifiers. If PidX is larger than PidY, then PidX was created after +PidY assuming both identifiers originates from the same node. However, +since we have a quite limited amount of unique identifiers today +(2^28), this property cannot be relied upon if we create large amount +of processes. But never the less, this is a property the system always +have had. + +If we would have had a huge amount of unique identifiers available, it +would have tempting to drop or modify this ordering property as +described above. The ordering property could for example be based on +the scheduler performing the spawn operation. It would have been +possible to reserve large ranges of identifiers exclusive for each +scheduler thread which could be used minimizing the need for +communication when allocating identifiers. The amount of identifiers +we got to work with today is, however, not even close to be enough for +such an approach. + +Since we have a limited amount of unique identifiers, we need to be +careful not to waste them. If previously used identifiers are reused +too quick, identifiers originating from terminated processes will +refer to newly created processes, and mixups will occur. The +previously used approach was quite good at not wasting +identifiers. Using a modified version of the same approach also lets +us keep the ordering property that we have always had. + +#### Insert #### + +The original approach is more or less to search for next free index or +slot in the array. The search starts from the last slot allocated. If +we reach the end of the array we increase a "wrapped counter" and then +continue the search. The process identifier is constructed by writing +the index to the least significant set of bits, and the "wrapped +counter" to the most significant set of bits. The amount of bits in +each set of bits is decided at boot time, so that maximum index will +just fit into the least significant set of bits. + +In the modified lock free version of this approach we more or less do +it the same way, but with some important modifications trying to avoid +unnecessary contention when multiple schedulers create processes +simultaneously. Since multiple threads might be trying to search for +the next free slot at the same time from the same starting point we +want subsequent slots to be located in different cache lines. Multiple +schedulers simultaneously writing new pointers into the table are +therefore very likely to write into adjacent slots. If adjacent slots +are located in the same cache line all modification of this cache line +needs to be communicated between all involved processors which will be +very expensive and scale very poor. By locating adjacent slots in +different cache lines only true conflicts will trigger communication +between involved processors, i.e., avoiding false sharing. + +A cache line is larger than a pointer, typically 8 or 16 times larger, +so using one cache line for each slot only containing one pointer +would be a waste of space. Each cache line will be able to hold a +fixed amount of slots. The first slot of the table will be the first +slot of the first cache line, the second slot of the table will be the +first slot of the second cache line until we reach the end of the +array. The next slot after that will be the second slot of the first +cache line, etc, moving forward one cache line internal slot each time +we wrap. This way we will be able to fit the same amount of pointers +into an array of the same size while always keeping adjacent slots in +different cache lines. + +The mapping from identifier to slot or index into the array gets a bit +more complicated than before. Instead of a `shift` and a bitwise +`and`, we get two `shift`s, two bitwise `and`s, and an `add` (see +implementation of `erts_ptab_data2pix()` in `erl_ptab.h`). However, by +storing this information optimized for lookup we only need a `shift` +and a bitwise `and` on 32-bit platforms. On 64-bit platforms we got +enough room for the 28-bit identifier in the least significant +halfword, and the index in the most significant halfword, in other +words, we just need to read the most significant halfword to get the +index. That is, this operation is as fast, or faster than before. The +downside is that on 32-bit platforms we need to convert this +information into the 28-bit identifier number when printing, or when +ordering identifiers from the same node. These operations are, +however, extremely infrequent compared to lookups. + +When we insert a new element in the table we do the following: + +1. We begin by reserving space in the table by atomically + incrementing a counter of processes in the table. If our increment + brings the counter above the maximum size of the table, the + operation fail and a `system_limit` exception is raised. + +2. The table contains a 64-bit atomic variable of the last identifier + used. Only the least significant bits will be used when actually + creating the identifier. This identifier is where the search + begin. + +3. We increment last identifier value used. In order determine the + slot that corresponds to this identifier we call + `erts_ptab_data2pix()` that maps identifier to slot. We read the + content of the slot. If the slot is free we try to write a + reservation marker using an atomic compare and swap. If this fails + we repeat this step until it succeeds. + +4. Change the table variable of last identifier used. Since multiple + writes might occur at the same time this value may already have + been changed by to an identifier larger that the one we got. In + this case we can continue; otherwise, we need to change it to the + identifier we got. + +5. We now do some initializations of the process structure that + cannot be done before we know the process identifier, and have to + be done before we publish the structure in the table. This, for + example, includes storing the identifier in the process structure. + +6. Now we can publish the structure in the table by writing the the + pointer to the process structure in the slot previously reserved + in 3. + +Using this approach we keep the properties like identifier ordering, +and identifier reuse while improving performance and scalability. It +has one flaw, though. There is no guarantee that the operation will +terminate. This can quite easily be fixed though, and will be fixed in +the next release. We will get back to this below. + +#### Delete #### + +When a process terminates, we mark the process as terminated in the +process structure, the counter of number of processes in the table is +decreased, and the reference to the process structure is removed by +writing a `NULL` pointer into the corresponding slot. The scheduler +thread performing this then schedule a thread progress later job which +will do the final cleanup and deallocate the process structure. The +thread progress functionality will make sure that this job will not +execute until it is certain that all managed threads have dropped all +references to the process structure. + +### BIF Iterating Over the Table ### + +The `erlang:processes/1` and `erlang:port/1` BIFs iterate over the +tables and return corresponding identifiers. These BIF should return a +consistent snapshot of the table content during some time when the BIF +is executing. In order to implement this we use locking in a strange +way. We use an "inverted rwlock". + +When performing lookups in the table we do not need to bother about +the locking at all, but when modifying the table we read lock the +rwlock protecting the table which allows for multiple writers during +normal operation. When the BIF that iterates over the table need +access to the table it write locks the rwlock and reads content of the +table. The BIF do not read the whole table in one go but instead read +small chunks at time only write locking while reading. The actual +implementation of the BIFs is out of the scope of this document. + +An out of the box rwlock will typically suffer from contention on the +single cache line containing the state of the rwlock even in the case +we are only read locking. Instead of using such an rwlock, we have our +own implementation of reader optimized rwlocks which keeps track of +reader threads in separate thread specific cache lines. This in order +to avoid contention on a singe cache line. As long as we only do read +lock operations, threads only need to read a global cache line and +modify its own cache line, and by this minimize communication between +involved processors. The iterating BIFs are normally very infrequently +used, so in the normal case we will only do read lock operations on +the table global rwlock. + +### Future Improvements ### + +The first improvement is to fix the guarantee so that insert +operations will be guaranteed to terminate. When the operation starts +we verify that there actually exist a free slot that we can use. The +problem is that we might not find it since it may move when multiple +threads modify the table at the same time as we are trying to find the +slot. The easy fix is to abort the operation if an empty slot could +not be found in a finite number operation, and then restart the +operation under a write lock. This will be implemented in next +release, but furter work should be made trying to find a better +solution. + +This and also previous implementation do not work well when the table +is nearly full. We will both get long search times for free slots, and +we will reuse identifiers more frequently since we more frequently +wrap during the search. These tables works best when the table is much +larger than the amount of simultaneous existing processes. One easy +improvement is to always have room for more processes than we allow in +the table. This will also be implemented in the next release, but this +should probably also be worked more on trying to find an even better +solution. + +It would also be nice to get rid of the rwlock all together. The use +of a reader optimized rwlock makes sure we do not any contention on +the lock, but unnecessary memory barriers will be issued due to the +lock. The main issue here is to modify iterating BIFs so that they do +not require exclusive access to the table while reading a sequence of +slots. In principle this should be rather easy, the code can handle +sequences of variable sizes, so shrinking the sequence size of slots +to one would solv the problem. This will, however, need some tweeks +and modifications of not trival code, but is something that should be +looked at in the future. + +By increasing the size of identifiers, at least on 64-bit machines +(which isn't as easy as it first might seem) we get further room for +improvement. Besides the obvious improvement of not reusing +identifiers as fast as we currently do, it makes it possible to +further avoid contention when inserting elements in the table. At +least if we drop this ordering property, which isn't that useful +anyway. + +### Some Benchmark Results ### + +In order to test modifications of the process table we ran a couple of +benchmarks where lots of processes are spawned and terminated +simultaneously, and got a speedup of between 150-200%. Running a +similar benchmark but with ports we got a speedup of about 130%. + +The BIF `erlang:is_process_alive/1` is the closest you can get to a +process table lookup only. The BIF looks up the process corresponding +to the process identifier passed as argument, and then checks if it is +alive. By running multiple processes looping over this BIF checking +the same process, we get a speedup between 20000-23000%. Conceptually +this operation only involve read operations. In the implementation +used in R16B also only read operation are performed, while the +previous implementation need to lock structures in order to read the +data, suffering from both lock contention and contention due to +modifications of cache lines used by lock internal data structures and +the reference counter on the process being looked up. + +The benchmarks were run on a relatively new machine with an Intel i7 +quad core processor with hyper-threading using 8 schedulers. On a +machine with more communication overhead and/or larger amount of +logical processors the speedups are expected to be even larger. -- cgit v1.2.3