As programmers, we can embed fail-safes into our code. In essence, we get decide what failure looks like. As we are writing code we choose how rigorously we validate our work; for example, checking that values are not null, lists are not empty, or numbers stay within valid ranges. In the event that something bad happens, and we correctly wrote code to test for this event, we can reject or accept what happened and let the program continue (failing silently) or completely halt the program in a crash. Failing silently versus crashing is not a difference of programming philosophy – a good programmer will use either strategy depending on the situation.
Failing silently naively sounds like the more appealing implementation because projects will appear to be more robust. However, letting the program continue after an error was encountered can let problems fester and makes it much harder to track down and identify the origin of defects. While crashing doesn’t obscure problems, the negative consequence is obvious – once a project ships crashing is unacceptable. Additionally, during development crashing can lead to some pretty embarrassing demos as well as create blockers for other developers and quality assurance.
Here are some example scenarios.
- Let’s say we are in the middle of development where it is incredibly common to only have partial data or partially defined objects: maybe a character doesn’t have a hair color, perhaps quests are missing some display text, or icons have not been made for the user interface yet. All of these are cases where it is appropriate to fail silently; none of these issues are serious enough to halt development until they are fixed (although, perhaps they should fall back to a default).
- Let’s say that your game loads some files when it initializes; maybe these files define the game world, or lists all of the game items, or it links characters with image and audio files. If a file is missing or doesn’t pass validation you want the game to fail immediately, even before that developer checks in their work to source control and “breaks the build” for other developers. Crashing or halting the program is appropriate anytime where the problem would be a blocker if the user encountered it.
- Let’s say your game connects to some other service that sends it data (maybe a teacher can configure difficulty or content for their students from a webpage) and this service goes down or starts sending incoherent data. Let’s also say that this service is developed by a 3rd party whose problems you can’t fix. If you go with the fail silently strategy, throwing out invalid or missing teacher configuration and falling back on some default values, you have completely hidden the problem. Particularly since the teacher and 3rd party service are not playing the game. If you crash or halt the game you are creating an immediate fix scenario that is out of your control to fix (not to mention taking the heat for someone else’s mistake).
Failing silently and crashing are basic examples of error handling, how the program responds once an error has occurred. Ideally we want to catch or prevent errors as early as possible; before it affects our co-workers, and definitely before they reach end users.
The reality of game development is that everyone wants a bug-free product but it is hard to convince anyone to invest time in work that is not new features. Time is an extremely limited commodity. Game development is also very organic. We evaluate the project and implement new ideas all the time. So during development we are constantly wiring things together in new and unplanned ways. Efficiently preventing defects in the first place is essential and there have been been many strategies devised to do so:
- Unit Tests, which is just more code that defines input and expected output for individual methods. There is also a variation of unit tests where you write the tests first, then write the code that satisfies the tests called Test-Driven Development. Not all parts of a game can be unit tested. For example, not all methods return a value, so the user interface cannot be unit tested, and any feature that has a large amount of variance or randomness will be hard to unit test.
- Smoke Testing, which is an agreed upon checklist of things for a developer to sit down and play to prove to themselves they didn’t break anything major before they check in their work. A smoke test could look like “all developers need to play through the tutorial” or “developers need to complete world 1-1” before submitting their work. There is inevitable human error in smoke testing – a developer can just forget to do it or skip it thinking “there is no way this small change will affect this other system.”
- Bots, which are AI simulated players used to play a game. Bots are usually only seen on multiplayer games where they are used for both load testing (how many players a server can support) and to find defects that require multiple players – for example, multiple bots killing an enemy at the same time. Writing bots is pretty complicated and time consuming – usually you will need to take shortcuts like teleporting instead of walking, so bots do not play the game in the same way a person would.
No single strategy is perfect and they all take time and require updates throughout development. My personal preference on preventing errors is known as Defensive Programming. It is a combination of a couple ideas:
- You are never going to be better at debugging your code than when you are writing it for the first time.
- You need to write code for things that you never expect to happen or never should happen.
- Instead of only letting code fail silently, you add warnings that capture those unexpected events.
- Preferably you are logging these warnings where they are preserved even if your application crashes.
For example, one rule is to always write an “else” case in a conditional statement, or write a “default” case in a switch statement; even if they only contain a warning indicating that you should never hit that case. It takes a little longer to write code with extra sanity checks, but it is the most efficient practice compared to other defect prevention. Let’s look at a pseudo-code example. Consider that we are writing a system where people can make friends. Here is an example implementation that can crash: