© Alex Denford, 2017

The Event System Bug

August 23, 2017

Event System Overview:
The event system is a back-end data structure which allows something to connect / disconnect to it. Anything connected to it when it is called, will be notified and can do things based on what happened. An example is an event for a client joining the area, or leaving, or a mine becoming armed etc. In the mine example, something would connect to it (for example the detonate mine skill) and make the mine detonate once it has armed. It is used a lot throughout the game in various places and is a great alternative to adding a listener to a whole class which doesn't work in many situations. It allows the triggering of a function from an external source without them having to care who is actually connected to the event. This implements the Observer design pattern and is found in most games. There are different connection types such as method (allows you to connect to a function of another class etc.), function (standard function connection) or event (can connect an event to an event in order to chain events etc.).

 

Event System Tech:
The Event system stores a pointer to an array of unsigned chars (dynamic sink data) as well as an static unsigned char array of 120 bytes (128 bytes - 2 pointers for a null terminating pointer as well as a start pointer to store the amount of dynamic memory allocated. The sink_data is what is used to store pointers to the connected function, connected class or connected event etc.

 

        // Dynamic sink data
        mutable unsigned char* sink_data;
        
        // Local sink data
        mutable unsigned char local_sink_data[ SinkDataSize ];

 

The system uses the static array until it runs out of space and then allocates dynamic memory and copies the data over. After that it works just like most common dynamic data structures where it will reallocate memory once it is required, each time allocated more than previously.

The event system handles event disconnects during event execution by implementing a linked list of "IterationState" structures which point to the current sink data. This linked list gets added to each time the event is called (a new node is added to the end) which allows for recursive event calls. If an event disconnects during the event call it will fix all of the iteration state pointers by adjusting them by the amount of data lost from the removal:

 

        for( auto s = iteration_state; s; s = s->next )
        {
            if( reinterpret_cast< unsigned char* >( sink ) < s->iter )
                s->iter -= size;
        }

 

The check in there is because we only need to fix the iteration states that are "behind" us because they will be the ones to break once we go back up the call stack to where they were used. Linked list nodes after the one being removed are left intact. The iter here is the pointer to the sink_data.

 

The Bug:
The game would crash deep inside the event system event call with the sink_data pointer being corrupt. This was the most common crash on the live realm after the deploy of 3.0.0 (Fall of Oriath) and all of the call stacks where extremely unhelpful. We did have one crash on the weapon_swap event when switching from 1 weapon with summon raging spirits (a skill) equipped to two weapons with 3 summon raging spirits in each weapon. All of the skill gems had microtransactions (skill effects) applied to them. We narrowed down the event connection to be related to the microtransaction system (each gem with a microtransaction disconnects and connects to the weapon_swap event when you switch weapons). The bug was not reproducable in any other situation and changing any of the items or gems would stop it from crashing. Even more annoying was even with the correct setup it only crashed sometimes, we could not figure out why. One of the QA members even went to the extent of copying his .GGPK file (the game data file) for us to use to reproduce the bug! 

 

My co-worker implemented logging inside the event system to track the disconnects and connects. That in itself was a hassle because you have to understand, the event system is used A LOT in the game. He had to implement a global void pointer which would point to a specific event. It would be set to nullptr everywhere except for when we call our weapon_swap event connects / disconnects, and then after that we would switch it back to nullptr again. We immediately had issues with this due to the fact that events can be called recursively, so we had to ensure the global pointer would save its previous pointer so that it could switch back to the previous event after our logging. 

 

After this we did the repro and got it to crash. The latest pointer that was trying to call the event wasn't anywhere else in the logs. What that meant is we somehow had an event executing that had never been connected to. How was that possible? My first thought was memory corruption. We investigated the whole implemention of the event system and how it worked. He assuredly told me the event system handles connections / disconnections while the event is being called. This is true, it does using the iteration state and it works. If it didn't the game would be crashing everywhere. 

Finally we investigated the implementation for copying from the local_sink_data (static array) to dynamic sink_data. It does this when it tries to make a new connection and it doesn't have enough memory in the static array to fit it. 

 

        const size_t size_left = sink_data + GetDataSize() - ptr;
        const size_t new_size = sink.SizeOf();
        if( new_size + sizeof( void* ) > size_left )
            ..
            Expand()
            ..

 

Through this we realised the expand function doesn't handle updating the iterationstate linked list pointers. We quickly did some math to see if this was likely the culprit:

        Method Event contains:
            - a pointer to the servant (the class / function owner) - 4 bytes
            - a pointer to the method / function - 4 bytes
            - a pointer to a virtual table (each sink overrides a base sink struct with 3 pure virtual functions) - 4 bytes    
        This results in a total of 12 bytes per method event.
        
Coincidentally in our logs, we were connecting to 10 of these events in a row (12 * 10 = 120 bytes; the maximum size of the static memory array) and we were crashing on the next one.. We added logging to the Expand function and, sure enough, the crash occurred after it expanded the sink_data (copying the data to the dynamic system) and tried to continue processing the event calls. The fix was relatively simple, we just had to fix the pointers in each of the iteration state nodes after we called Expand. This involved calculating the difference between the new pointer (dynamic array location) and the old pointer location and adding this difference to the saved iteration state pointer. Boom, problem solved! 

 

Pretty strange to find a bug like this after the file / system was first implemented in 2009 and has only had a few commits to it since then. Even more surprising is we had never had a situation where the 120 byte limit for storing events had been surpassed inside an event execution. This also explained why it was so hard to reproduce; that 120 byte limit was getting passed very quickly from regular game-play, so to reproduce the bug, the weapon switch had to be one of your first actions when you first enter a new instance. None the less, it was an interesting and confusing bug to solve, especially when delving into the depths of the event system.

Please reload