Alex Denford

Personal Blog

Development, gaming, programming,

ideas, thoughts and more!

The mystery hash collision bug

Last week we released our new content update and league for Path of Exile called "Ultimatum League". With it players are encouraged to start a new character on the new Ultimatum league which has a new NPC with a unique and interesting new game mechanic revolving around doing a task for the NPC and trying to succeed to get the reward he offers you. Alongside the league itself, the release has added new player skills, many new items, changed a lot of balance, incorporated the last league into the main game and reworked a bunch of old systems.

With adding such a huge amount of new content and changes, you can expect issues. Even with a big QA team and a lot of testing and iteration, there are unfortunately bugs that slip through the crack. Most of the time time we fix most crashes (especially any bad ones) on the first launch weekend and more minor ones being fixed in the coming weeks afterwards. Normally though, unsurprisingly, bugs are from the new content. Whether it be new player skills that have issues with one of the million possible other items or mechanics Path of Exile allows, or problems with the league mechanic itself, mistakes in code refactors or balance / design flaws with recent changes. What is unusual though is when we get a bug or crash in a system that hasn't been touched in the recent content update, which was the scenario for the following bug.

Unfortunately the Ultimatum league launch suffered from server issues which I won't dive into here, firstly because I don't fully understand all the details enough to talk about but also for potential security reasons and because I am not a server admin so I do not feel it is my place to discuss. While the servers were unstable, we weren't getting many actual crashes from players but there was one server crash that we saw (at very low numbers) that seemed strange. Once the server stability was fixed it became evident this crash was much more common that any others (most crashes have just a few cases, then there was 3-4 different crashes that have a few hundreds cases and then this which was reaching a few thousand after the first day). This is still very low compared to other issues we've had in the past on league launches but it needed to be fixed.

The crash itself occurred when an object was cleaned up after the player left the area, trying to disconnect from a variable in our scripting system. I assume because the server would only crash after player's would leave, was the reason why we never actually saw any player reports about this issue. You would only know if you were either in a party when it occurred or if you tried to go back to the old area (not very common, especially while levelling through the early game). Basically we have a scripting system which allows designers (and programmers) to write scripts for bosses / mechanics without having to write C++ and needing to recompile the game. The system allows for simple variables including the ability to store a game object (which could be an effect, or a boss or anything). This system is entirely self sufficient and works by connecting to the "on deleted" event for an object, to ensure the variable is always in a valid state (not a dangling pointer). The callstack for this crash told us that there was somehow an object in the variable system that was garbage and was trying to disconnect from it after the player left the area, which crashed because the memory wasn't valid.

First thing we did was add logging to find what object had the broken script variable. We also downloaded the core dump off of the live realm onto our local servers so we could debug it with GDB. Of course production cores usually have very little information and can be hard to debug to find useful information, especially when whatever was going wrong had already gone wrong and the crash was just later when cleaning up. We did find out the hash of the variable and used that to find the broken variable was the "killer" variable, used to point towards what object killed something. We then discovered that the variable was on the player object themselves, which meant the player was dying. So basically players were dying to something (which with logging we found no real pattern of areas or things killing the players, just the usual dangerous bosses etc.) and then that killer was deleted and resulted in a crash when the player left and the variable pointing to the killer was garbage. From here we added several different sets of logging trying to narrow it down further but could not find out anything else.

I spend several hours the day after launch reading through the scripting system and variables trying to find a flaw. Luckily I did, I found a piece of code that could result in what we were seeing. The problem was, for it to trigger it meant for some very very specific situation: The killer variable had to be set, then a new killer variable had to be set to null but with a flag either as "should serialise" or "locked", then the original killer monster deleted. Problem was the killer variable is never set to null, and it also never set with flags. None the less, there was a bug in this situation where the old variable wouldn't have been cleaned up properly so I was sure it must be the cause. It was at this point I clicked and we realised the possibility - hash collision! Every programmer has had to deal with hash collisions at some point, but when you're dealing with hashes barely totally more than 50 across the whole game at a time, in a 32 bit address space, it seems unfathomably unlikely that you could get a hash collision. Yet it was the only explanation for the bug. We quickly made a commit to address the flaw and immediately all cases of this crash stopped on production! Success, job well done right? Well unfortunately, we were no closer to finding the offending hash collision (assuming that is actually what is causing the crash).

This leads us to where we are now, we have fixed the crash and added back new logging to try find the hash collision. We have implemented a hash collision checker locally (which was trivial because we already had a cache in development mode that stores a variable name to a hash, so you can debug the variables in-game). So now we will either find the offending case locally (with QA help) or if we have no luck, we can do something which we have done in the past to find it: In the case of the situation where it would have failed to clean up the original killer variable, we fork the instance process and crash it in order to generate a callstack (without affecting the actual instance).

Game Developement

Building and creating video games (and software) is hard. How hard? It can be a difficult thing to describe and understand. The public perception, at least in my eyes, appears to fall onto one of two perceptions; That software is complex and basically magic or that of being incognizant of the hidden workings of the software behind the interface. I often get asked, "So how do you actually make games? How do you make things that appear on the screen". The real answer to this requires thousands of hours of knowledge, study and understanding across multiple disciplines. In simple terms though the answer really is mathematics and logic combined with physics and electronics. Many years of very smart people building technology and layers of systems and mechanics combining to where we are today, able to make incredible 3D simulations with highly detailed graphics and interaction.

"Why does this crappy game keep crashing?", "Why is this software broken?".

These are common questions, and frustrating when you are an end user. Unfortunately, like vehicles, infrastructure and other complex technology, software is no different in that it can have faults, deteriorate over time and just flat out break.

I think it can be hard to comprehend just how complex software can be. Why is it so hard to make X or do Y?

Building a complex video game is like contracting a skyscraper: Hundreds of thousands of moving parts, millions of interactions and connections, safety mechanisms, backup systems, hundreds of different feature requirements and aspects, you get the point. Software and games also have some other fun parts to consider. Video games are built and compiled from source code into machine code (at least in compiled languages such as C++) using software that itself has millions of lines of code and is also running on more software (your operating system) which itself is communicating and working on top of the hardware of your computer which has its own low level mechanisms and features. Programming does not exist in the physical space (at least in the general perspective, compared to traditional engineering), which means anything can interact with almost anything else. To compare, that is like every wall being able to touch and connect with every other wall in your skyscraper, as well as your wires, plumbing, flooring, air vents and everything else. Changing one small thing like the carpet in your 5th floor could accidentally break the door to the bathroom on the 50th floor. Of course traditional engineering has it's own difficulties and I am not saying a game is more difficult to make than a sky scraper; not even close. But the idea is that software is in the same vein of complexity and likely has a lot more happening that you might realise.

You can create software from nothing which can be a great benefit, unlike traditional engineering. Although this means more people making software and very low requirement for entry versus if you were an engineer or architect, you'd need tools, resources, money and usually a whole team. This results in a lot more software being made by people with a lot less knowledge, care, or planning.

Because of the lack of resource requirements (material wise), and the ability for software to be adjusted and changed (relatively) easily, at least compared to traditional engineering, simple projects can scale up to massive ones. This means it is far more common to have a project that ends up far out of scope and design that was originally intended, which contributes to having more bugs and more unforeseen outcomes / problems etc.

Bad software (usually) won't cause harm or death to users, whereas bad engineering absolutely can. Thus the restrictions, policies, safety checks and rigidity of traditional engineering is much higher compared to software.

Another fun aspect of software is that it is equivalent to basically creating an entire skyscraper, with all of these layers of complexities, and then shipping that skyscraper to someone's location and placing it there, relying on their foundations, environment and supports to make sure it works. Every end users PC is different: Different hardware, different operating systems, different software, different versions, different internet connections, the list goes on. Traditional engineering would create a product designed specifically for a scenario (specific city, location, environment, setup, requirements etc.) but that just can't be done realistically with software. Of course there are exceptions to this but the idea is there. It doesn't help that a lot of software is cheap (or free) and easily accessible (online) so many people will try to run software or games that their system can't even handle. Then they complain that the product is terrible, when it is like shipping a skyscraper into the desert with no support and wondering why it doesn't survive.

There are so many variables, so many layers of technology, so many moving parts and complexities that it shouldn't be too surprising when every now and then your program crashes, or lags, or displays some information correctly. Technology is complex and the layers we have built to allow for the creation of things is unbelievable. None the less, we as humans clearly have the capability to create unbelievably complex systems with fail-safes and methods to stop problems occurring, and we should strive to make things that never break, or if they do, to fix themselves quickly without resulting in any damage of inconvenience. There are always two sides to the story. With every complex bug that only occurs on one guy's pc who hasn't got a graphics card nor updated his software in 5 years, there's a story of simple typo in an obvious place from a neglecting programmer that causes everyone to crash on startup. Software needs to get better and become more reliable, and people need to become more understanding and aware. After-all, we as programmers don't want you to crash and have a bad experience!

Next time your software crashes or game bugs out, have a think and appreciate the skyscraper (both as the analogy and as a metaphor) of complex code that is running behind the scenes to make that thing tick!

The Template Insertion Pattern (C++)

Updated: Mar 21, 2021

How I created a strange design pattern and how it proved to be so incredibly useful.

Like many amazing finds or new things, this particular solution came from a tricky problem. To understand the issue best, I have created a small diagram below that shows a simplified hierarchy of the action classes:

*This diagram is not an actual representation of the classes, it just demonstrates the basic idea.

As you can see at the bottom of the hierarchy, a particular action has 3 classes (a shared implementation and then child classes for both the server and client specific code). This means the client and server implementations derive from a common base and it is easy to write common code for both as well as functionality for client / server using either separate calls or virtual functions that are overridden. Above these leaf classes we have many parent classes that handle various different functionalities.

The issue here, which is actually a relatively common issue, is this:

What happens when we need one of the parent class functionalities to have common server or client code? It would look like this:

Ah, the classic diamond inheritance; Both the shared implementation and the server or client specific parent need to derive from the classes above, because their callbacks and variables rely on that information. I guess you could draw the diagram with the connection either how it is (with Fireball a child of Projectile Client) or Fireball Client as a child of Projectile Client. Either way, the issue is the same (the client has ambiguous parent classes). What we need is for Fireball to have derive from Projectile Client, but only on the actual client (because otherwise Fireball Server will be trying to derive from a class that it doesn't know about.

In the past, the solution has just been to put virtual functions in the shared implementation and then for each child that needs to, will implement the functionality themselves. This is especially annoying though when there are many classes and/or when the implementation is almost identical. Bonus annoyance if there is a lot of code for the implementation(s).

So we solved this by coming up with the idea of what we called 'template parent insertion'. I am sure there must be a formal name for this but I have not been able to find it (let me know if you know!). At first glance it appears very similar to The Curiously Recurring Template Pattern (CRTP), but in reality it is quite different, and is actually solving a different problem.

So how does it work? First of all, here is a diagram to help explain:

*I will attach some source code below to complement this diagram.

Basically the idea is that the shared class (Fireball in this case) is a template class which derives from the template itself, The child classes derive from this and pass in the class that they want Fireball to inherit from, via the template. Hence why, at the time, we sort of called it 'template insertion'; because it allows you to stick a class in-between. The server or client can insert different classes to each other, or one can insert a class and the other doesn't need to, which provides a great design for class hierarchies like this where we have a separate client and server implementation with a shared base.

Pitfalls:

The main issues I faced with this design is compiling with GCC on Linux. Firstly, in the base class (in this case Fireball), when referencing functions or members from the parent class (the template class), I had to always use this-> or it would be a compilation error. The other issue, in the same scenario but for template functions called from members or functions from the parent class, I had to always use the following syntax:

this->GetObject().template GetComponent< Health >();

Instead of:

GetObject().GetComponent< Health >();

Details of this issue can be read here.

So there it is, a relatively nice solution to a common problem. Although I'm sure in most situations you are better off rethinking your design such as trying to use free standing helper functions, or creating a separate implementation that doesn't derive from the Projectile, or even just using regular composition (storing a member that handles the specific implementation that can't be in shared code). For myself and the situation I had, this proved to be the easiest method to achieve what I needed, so maybe it will prove useful for others too. Programming has infinite possibilities and there is no perfect solution to everything. What is important though, is having as many patterns, tricks, designs and knowledge as possible to construct the best possible solution you can for a particular problem!

Example source code available here.

2 3 4 5