Alex Denford

Personal Blog

Development, gaming, programming,

ideas, thoughts and more!

12 steps to making sure your dev team runs as inefficiently as possible

1. Never build up trust in your team or with your developers. You do not want a developer to go rogue and be trusted to work on what he wants, or how he wants to. This can cause devs to work on things you didn't expect. You also do not want to trust engineers with their problem solving and approaches. Every change should be criticized and validated by a host of other developers, including non engineers.

2. Isolate engineers to pods and minimise collaboration across teams. You do not want developers to easily work together across teams. It is best if your senior engineers are limited in providing impact to the sub team they are part of and nothing else. Ensure developers cannot offer ideas, solutions or insight on anything outside of their team.

3. Disallow experienced engineers to work on things they want to. Prevent any sort of rogue engineering or anyone that wants to improve or fix something out of pure choice. You should entirely stick to a rigid schedule that is driven by management and the release cycle and do not let engineers influence or manipulate this plan.

4. Hire as many low skill junior engineers as possible, but provide little to no guidance as to create as many bugs as possible. Bonus points the more time they consume from experience developers on the team. It also works really well when you hire engineers for a role and then have them work on a different part of the tech or a different project entirely.

5. Never value senior engineers and your 'old timers', the people who have all the knowledge. This is bad, the faster you can cycle new devs the better. We don't want anyone to build up too much knowledge of the code-base and become a gate keeper (don't get me wrong this actually can be a real issue, but honestly you should value your senior and old timers at all costs).

6. Make sure code reviews cause a huge road block, question every change and force developers to constantly be trying to keep up with the ever changing guidelines, peoples' opinions and the goals of the project.

7. Ensure your build architecture is as slow as possible, you don't want people working too fast. If you have automated tests set up, make sure they are super flaky and break on random platforms, this makes sure to keep your devs on their toes and be on the look out for problems.

8. Never let your engineers become leads or managers. We want to make sure all leadership and project decisions are from non-tech people. This way we can control the engineers who have no clue how business or project management works. Extra points if you regularly cycle out new managers and leads who are new to the project AND have no tech experience.

9. Make sure that all work requires full design documents that have gone through approval from everyone on the team. This works best when the design is vague, forcing engineers to create the doc without knowing the full constraints, and then we can control them best by adjusting the design and showing that their solution will not work for the requirements. Bonus points if your design is driven by one or two people, who live in a different time zone and are hard to get a hold of. Even better if they aren't closely associated with the project and are not up to date with what the best design choices even should be.

10. Regularly reschedule your team structure, hierarchy and all the devs. You don't want anyone getting too comfortable. Moving teams, entirely changing teams, resizing teams and changing team priorities are all great ways to keep your engineers engaged and interested.

11. Limit all engineers to tiny PRs so that refactors and larger changes happen over many commits. Smaller PRs are easier to review and so it's better if chunks of work go into main bit by bit, and then we can do full test runs, QA checks and perf testing on every small change. This works well so that if engineers leave the project part way through a collection of work items, the part way done work is already there and ready for others to pick up and finish of (and definitely not get left unfinished in the project forever).

12. Regularly discuss and evaluate naming for features, projects and code files. This includes naming semantics/conventions. It's best if you regularly modify these conventions to keep up with changing times and new engineers' opinions. It is beneficial work time spent if you do large naming refactors, or even better is to do small commits to change the naming convention to parts of the code and then add a PR hook that breaks if you modify a file and don't update it to the new naming convention.

\s Tongue in cheek, if it wasn't blatantly obvious..

Disclaimer:

This is just a fun thought exercise and the points are just a collection of personal thoughts.

They are not reflective of my current employment nor any company I have previously worked for.

The mystery hash collision bug

Last week we released our new content update and league for Path of Exile called "Ultimatum League". With it players are encouraged to start a new character on the new Ultimatum league which has a new NPC with a unique and interesting new game mechanic revolving around doing a task for the NPC and trying to succeed to get the reward he offers you. Alongside the league itself, the release has added new player skills, many new items, changed a lot of balance, incorporated the last league into the main game and reworked a bunch of old systems.

With adding such a huge amount of new content and changes, you can expect issues. Even with a big QA team and a lot of testing and iteration, there are unfortunately bugs that slip through the crack. Most of the time time we fix most crashes (especially any bad ones) on the first launch weekend and more minor ones being fixed in the coming weeks afterwards. Normally though, unsurprisingly, bugs are from the new content. Whether it be new player skills that have issues with one of the million possible other items or mechanics Path of Exile allows, or problems with the league mechanic itself, mistakes in code refactors or balance / design flaws with recent changes. What is unusual though is when we get a bug or crash in a system that hasn't been touched in the recent content update, which was the scenario for the following bug.

Unfortunately the Ultimatum league launch suffered from server issues which I won't dive into here, firstly because I don't fully understand all the details enough to talk about but also for potential security reasons and because I am not a server admin so I do not feel it is my place to discuss. While the servers were unstable, we weren't getting many actual crashes from players but there was one server crash that we saw (at very low numbers) that seemed strange. Once the server stability was fixed it became evident this crash was much more common that any others (most crashes have just a few cases, then there was 3-4 different crashes that have a few hundreds cases and then this which was reaching a few thousand after the first day). This is still very low compared to other issues we've had in the past on league launches but it needed to be fixed.

The crash itself occurred when an object was cleaned up after the player left the area, trying to disconnect from a variable in our scripting system. I assume because the server would only crash after player's would leave, was the reason why we never actually saw any player reports about this issue. You would only know if you were either in a party when it occurred or if you tried to go back to the old area (not very common, especially while levelling through the early game). Basically we have a scripting system which allows designers (and programmers) to write scripts for bosses / mechanics without having to write C++ and needing to recompile the game. The system allows for simple variables including the ability to store a game object (which could be an effect, or a boss or anything). This system is entirely self sufficient and works by connecting to the "on deleted" event for an object, to ensure the variable is always in a valid state (not a dangling pointer). The callstack for this crash told us that there was somehow an object in the variable system that was garbage and was trying to disconnect from it after the player left the area, which crashed because the memory wasn't valid.

First thing we did was add logging to find what object had the broken script variable. We also downloaded the core dump off of the live realm onto our local servers so we could debug it with GDB. Of course production cores usually have very little information and can be hard to debug to find useful information, especially when whatever was going wrong had already gone wrong and the crash was just later when cleaning up. We did find out the hash of the variable and used that to find the broken variable was the "killer" variable, used to point towards what object killed something. We then discovered that the variable was on the player object themselves, which meant the player was dying. So basically players were dying to something (which with logging we found no real pattern of areas or things killing the players, just the usual dangerous bosses etc.) and then that killer was deleted and resulted in a crash when the player left and the variable pointing to the killer was garbage. From here we added several different sets of logging trying to narrow it down further but could not find out anything else.

I spend several hours the day after launch reading through the scripting system and variables trying to find a flaw. Luckily I did, I found a piece of code that could result in what we were seeing. The problem was, for it to trigger it meant for some very very specific situation: The killer variable had to be set, then a new killer variable had to be set to null but with a flag either as "should serialise" or "locked", then the original killer monster deleted. Problem was the killer variable is never set to null, and it also never set with flags. None the less, there was a bug in this situation where the old variable wouldn't have been cleaned up properly so I was sure it must be the cause. It was at this point I clicked and we realised the possibility - hash collision! Every programmer has had to deal with hash collisions at some point, but when you're dealing with hashes barely totally more than 50 across the whole game at a time, in a 32 bit address space, it seems unfathomably unlikely that you could get a hash collision. Yet it was the only explanation for the bug. We quickly made a commit to address the flaw and immediately all cases of this crash stopped on production! Success, job well done right? Well unfortunately, we were no closer to finding the offending hash collision (assuming that is actually what is causing the crash).

This leads us to where we are now, we have fixed the crash and added back new logging to try find the hash collision. We have implemented a hash collision checker locally (which was trivial because we already had a cache in development mode that stores a variable name to a hash, so you can debug the variables in-game). So now we will either find the offending case locally (with QA help) or if we have no luck, we can do something which we have done in the past to find it: In the case of the situation where it would have failed to clean up the original killer variable, we fork the instance process and crash it in order to generate a callstack (without affecting the actual instance).

Game Developement

Building and creating video games (and software) is hard. How hard? It can be a difficult thing to describe and understand. The public perception, at least in my eyes, appears to fall onto one of two perceptions; That software is complex and basically magic or that of being incognizant of the hidden workings of the software behind the interface. I often get asked, "So how do you actually make games? How do you make things that appear on the screen". The real answer to this requires thousands of hours of knowledge, study and understanding across multiple disciplines. In simple terms though the answer really is mathematics and logic combined with physics and electronics. Many years of very smart people building technology and layers of systems and mechanics combining to where we are today, able to make incredible 3D simulations with highly detailed graphics and interaction.

"Why does this crappy game keep crashing?", "Why is this software broken?".

These are common questions, and frustrating when you are an end user. Unfortunately, like vehicles, infrastructure and other complex technology, software is no different in that it can have faults, deteriorate over time and just flat out break.

I think it can be hard to comprehend just how complex software can be. Why is it so hard to make X or do Y?

Building a complex video game is like contracting a skyscraper: Hundreds of thousands of moving parts, millions of interactions and connections, safety mechanisms, backup systems, hundreds of different feature requirements and aspects, you get the point. Software and games also have some other fun parts to consider. Video games are built and compiled from source code into machine code (at least in compiled languages such as C++) using software that itself has millions of lines of code and is also running on more software (your operating system) which itself is communicating and working on top of the hardware of your computer which has its own low level mechanisms and features. Programming does not exist in the physical space (at least in the general perspective, compared to traditional engineering), which means anything can interact with almost anything else. To compare, that is like every wall being able to touch and connect with every other wall in your skyscraper, as well as your wires, plumbing, flooring, air vents and everything else. Changing one small thing like the carpet in your 5th floor could accidentally break the door to the bathroom on the 50th floor. Of course traditional engineering has it's own difficulties and I am not saying a game is more difficult to make than a sky scraper; not even close. But the idea is that software is in the same vein of complexity and likely has a lot more happening that you might realise.

You can create software from nothing which can be a great benefit, unlike traditional engineering. Although this means more people making software and very low requirement for entry versus if you were an engineer or architect, you'd need tools, resources, money and usually a whole team. This results in a lot more software being made by people with a lot less knowledge, care, or planning.

Because of the lack of resource requirements (material wise), and the ability for software to be adjusted and changed (relatively) easily, at least compared to traditional engineering, simple projects can scale up to massive ones. This means it is far more common to have a project that ends up far out of scope and design that was originally intended, which contributes to having more bugs and more unforeseen outcomes / problems etc.

Bad software (usually) won't cause harm or death to users, whereas bad engineering absolutely can. Thus the restrictions, policies, safety checks and rigidity of traditional engineering is much higher compared to software.

Another fun aspect of software is that it is equivalent to basically creating an entire skyscraper, with all of these layers of complexities, and then shipping that skyscraper to someone's location and placing it there, relying on their foundations, environment and supports to make sure it works. Every end users PC is different: Different hardware, different operating systems, different software, different versions, different internet connections, the list goes on. Traditional engineering would create a product designed specifically for a scenario (specific city, location, environment, setup, requirements etc.) but that just can't be done realistically with software. Of course there are exceptions to this but the idea is there. It doesn't help that a lot of software is cheap (or free) and easily accessible (online) so many people will try to run software or games that their system can't even handle. Then they complain that the product is terrible, when it is like shipping a skyscraper into the desert with no support and wondering why it doesn't survive.

There are so many variables, so many layers of technology, so many moving parts and complexities that it shouldn't be too surprising when every now and then your program crashes, or lags, or displays some information correctly. Technology is complex and the layers we have built to allow for the creation of things is unbelievable. None the less, we as humans clearly have the capability to create unbelievably complex systems with fail-safes and methods to stop problems occurring, and we should strive to make things that never break, or if they do, to fix themselves quickly without resulting in any damage of inconvenience. There are always two sides to the story. With every complex bug that only occurs on one guy's pc who hasn't got a graphics card nor updated his software in 5 years, there's a story of simple typo in an obvious place from a neglecting programmer that causes everyone to crash on startup. Software needs to get better and become more reliable, and people need to become more understanding and aware. After-all, we as programmers don't want you to crash and have a bad experience!

Next time your software crashes or game bugs out, have a think and appreciate the skyscraper (both as the analogy and as a metaphor) of complex code that is running behind the scenes to make that thing tick!

2 3 4 5