The mystery hash collision bug
Last week we released our new content update and league for Path of Exile called "Ultimatum League". With it players are encouraged to start a new character on the new Ultimatum league which has a new NPC with a unique and interesting new game mechanic revolving around doing a task for the NPC and trying to succeed to get the reward he offers you. Alongside the league itself, the release has added new player skills, many new items, changed a lot of balance, incorporated the last league into the main game and reworked a bunch of old systems.
With adding such a huge amount of new content and changes, you can expect issues. Even with a big QA team and a lot of testing and iteration, there are unfortunately bugs that slip through the crack. Most of the time time we fix most crashes (especially any bad ones) on the first launch weekend and more minor ones being fixed in the coming weeks afterwards. Normally though, unsurprisingly, bugs are from the new content. Whether it be new player skills that have issues with one of the million possible other items or mechanics Path of Exile allows, or problems with the league mechanic itself, mistakes in code refactors or balance / design flaws with recent changes. What is unusual though is when we get a bug or crash in a system that hasn't been touched in the recent content update, which was the scenario for the following bug.
Unfortunately the Ultimatum league launch suffered from server issues which I won't dive into here, firstly because I don't fully understand all the details enough to talk about but also for potential security reasons and because I am not a server admin so I do not feel it is my place to discuss. While the servers were unstable, we weren't getting many actual crashes from players but there was one server crash that we saw (at very low numbers) that seemed strange. Once the server stability was fixed it became evident this crash was much more common that any others (most crashes have just a few cases, then there was 3-4 different crashes that have a few hundreds cases and then this which was reaching a few thousand after the first day). This is still very low compared to other issues we've had in the past on league launches but it needed to be fixed.
The crash itself occurred when an object was cleaned up after the player left the area, trying to disconnect from a variable in our scripting system. I assume because the server would only crash after player's would leave, was the reason why we never actually saw any player reports about this issue. You would only know if you were either in a party when it occurred or if you tried to go back to the old area (not very common, especially while levelling through the early game). Basically we have a scripting system which allows designers (and programmers) to write scripts for bosses / mechanics without having to write C++ and needing to recompile the game. The system allows for simple variables including the ability to store a game object (which could be an effect, or a boss or anything). This system is entirely self sufficient and works by connecting to the "on deleted" event for an object, to ensure the variable is always in a valid state (not a dangling pointer). The callstack for this crash told us that there was somehow an object in the variable system that was garbage and was trying to disconnect from it after the player left the area, which crashed because the memory wasn't valid.
First thing we did was add logging to find what object had the broken script variable. We also downloaded the core dump off of the live realm onto our local servers so we could debug it with GDB. Of course production cores usually have very little information and can be hard to debug to find useful information, especially when whatever was going wrong had already gone wrong and the crash was just later when cleaning up. We did find out the hash of the variable and used that to find the broken variable was the "killer" variable, used to point towards what object killed something. We then discovered that the variable was on the player object themselves, which meant the player was dying. So basically players were dying to something (which with logging we found no real pattern of areas or things killing the players, just the usual dangerous bosses etc.) and then that killer was deleted and resulted in a crash when the player left and the variable pointing to the killer was garbage. From here we added several different sets of logging trying to narrow it down further but could not find out anything else.
I spend several hours the day after launch reading through the scripting system and variables trying to find a flaw. Luckily I did, I found a piece of code that could result in what we were seeing. The problem was, for it to trigger it meant for some very very specific situation: The killer variable had to be set, then a new killer variable had to be set to null but with a flag either as "should serialise" or "locked", then the original killer monster deleted. Problem was the killer variable is never set to null, and it also never set with flags. None the less, there was a bug in this situation where the old variable wouldn't have been cleaned up properly so I was sure it must be the cause. It was at this point I clicked and we realised the possibility - hash collision! Every programmer has had to deal with hash collisions at some point, but when you're dealing with hashes barely totally more than 50 across the whole game at a time, in a 32 bit address space, it seems unfathomably unlikely that you could get a hash collision. Yet it was the only explanation for the bug. We quickly made a commit to address the flaw and immediately all cases of this crash stopped on production! Success, job well done right? Well unfortunately, we were no closer to finding the offending hash collision (assuming that is actually what is causing the crash).
This leads us to where we are now, we have fixed the crash and added back new logging to try find the hash collision. We have implemented a hash collision checker locally (which was trivial because we already had a cache in development mode that stores a variable name to a hash, so you can debug the variables in-game). So now we will either find the offending case locally (with QA help) or if we have no luck, we can do something which we have done in the past to find it: In the case of the situation where it would have failed to clean up the original killer variable, we fork the instance process and crash it in order to generate a callstack (without affecting the actual instance).