Velocity 2X on Steam – Bug 166

A tale of disappointment, disbelief and gratuitous overkill, 13 months in the making.

cs_024

Background

This is the tale of the bug that got away. When we’d finished Velocity 2X (PC) there was one bug that had been waived, Bug 166: “Windows 10 with a Geforce GTX 760 has stability issues running the missions in the Critical Urgency Pack”. At the time, Windows 10 had not been launched, we did not have a copy, and one single GTX card not working for one DLC mission did not seem like that big a deal. Reading our logs, it wasn’t much effort to get Sierra to waive it, and we proceeded to launch on the Steam Store.

For those not in the know, it takes a few weeks from us getting a game declared finished, to it showing up for sale. In the case of the all-digital Steam Store, this gives marketing a chance to build the hype and make sure it comes out of the gate with a strong launch. And in our case, also let Microsoft slip Windows 10 out to retail!.

Microsoft aggressively started pushing Windows 10 upgrades on all their Windows 7/8 users, With a degree in computer design and considerable software engineering chops, I recall it being far more effort than I wanted to ensure my PC stayed behind on Windows 8.1. I’m old enough to remember the adage: “Don’t upgrade Windows, until SP1 has shipped”.

This forced-upgrade strategy got them the “strongest Windows launch ever”, and looking at today’s numbers just under half of you have made the switch.

Soon after launch, the forums started filling up with reports that the bug was bigger than our report indicated. It was occurring on all NVidia cards more advanced than the GTX 500 series (that’s a really old one now), with Windows 10. It was also not confined to a single DLC, rather it was preventing progression of the main game, blocking level 9. This couldn’t be ignored any longer.

Pass the Buck

It was time to dig in, and get to the bottom of what was going on. We took an image of the hard-drive from one of the PCs here, and then let it upgrade itself to Windows 10; I still wasn’t prepared to keep Windows 10 after this was all done.

For the 1st time ever, I was able to repro the bug: 0x8876017c – “Out of Video Memory”

That did not seem possible. I was running on a GTX 980 which had 4GB of RAM. GPU-Z was telling me that ~1GB was in use. It failed trying to allocate 1620 bytes, that should have been fine.

It therefore seemed likely that this was some kind of configuration error. Painstakingly adding logging to the code, and cross-checking all D3D calls for correctness revealed a few minor issues, but fixing those made no difference – the bug was still occurring, but only on non-ancient NVidia cards.

We knew that our code was working on all Intel / ATI cards, and also the ancient NVidia cards. We knew that we only accessed all D3D resources from the main thread (bad threading can do all manner of nasty things to you). We knew that nowhere did we leave a D3D resource dangling, every one created was also destroyed. We’d made sure everything was aligned to 16/32/64 byte boundaries, as that’s been known to cause odd irregularities. Finally an internal RAM count was close-enough to GPU-Z’s figure that we felt we could believe theirs. This was NOT our problem.

We had powerful friends; Sierra is a label of Activision, the world’s biggest game’s publisher. We reported our findings to them, and asked if they had connections in NVidia they could call upon. They came back to us, and confirmed that NVidia would look at this. They asked for a debug build without Steam, so they could dive into it and make measurements – we have nice configuration-management here, so that was trivial to give them. All we had to do now was wait…

…and wait…

…and wait…

…and wait…

cs_043

The One that Got Away

Somewhere in the middle of all of this, we came so close to getting an acceptable workaround. As soon as the PC version was released, we set about doing the ports to Mac / Linux. For the first time ever, we couldn’t use Visual Studio to do our development, so we needed a new plan. I’m a firm believer that every cloud should have a silver lining, so saw this as an opportunity to upgrade our configuration management even further.

We migrated to CMake, which let us generate XCode projects on the Mac, and (with a little bit of coaxing) Visual Studio projects for PC/Vita/PlayStation 4/Xbox One.

One of the weaknesses of Visual Studio, is that it only lets you switch between their predefined platforms, there’s no options to have “virtual” platforms such as “Win32 DX9” and “Win32 DX11”. CMake gave us an unlimited number of configuration options, which could be applied. This was the perfect time to wrap Xbox One’s DX11 code onto the PC platform. Until this point we were either/or DX9/DX11, now we could be both.

Running on DX11, Windows 10 and modern NVidia was able to load into that fated DLC level. We no longer needed a fix from NVidia, we could just provide a DX11 / DX9 switch in the Launcher and let all our NVidia players progress with the DX11 version. We were still targeting Shader Model 3, so given they were already on Windows 10, we knew this would be compatible.

However right about that time FuturLab’s next project got cancelled. Suddenly we no longer had the time or the people in place, to complete the Launcher upgrade and more importantly provide QA for what effectively would be a completely new set of engine code; we’d expect a modest number of bugs, for that magnitude of change.

Outside the Box

A few months later, after the dust had settled, James from FuturLab pinged me on Skype: “Are you sure there’s nothing you can do for the Windows 10 users?”. I went over the arguments, that we did not have the QA capability to push through such a major change as a port to DX11. Now I’m not one to be beaten, so it was time to think about other ways to skin this cat.

From all we’d learned about this bug, perhaps the strangest thing had been that there was no Google hits for all the related searches. We were certain that this was an NVidia problem (and still are), but for some reason we were the only ones experiencing it. Now sure, putting out a DX9 title on Windows 10 is a little bit niche, but we originated on Vita so DX9 was more than capable of delivering our feature-set, and got us on the widest set of machines possible.

This led to us looking for what we were doing, that was somewhat atypical. We’d already been over our engine code, and knew that we were talking to D3D correctly now. What we needed was a view from a higher level than the engine. We’d been aware that we were looking at pages and pages of diagnostic output once we marked up the engine, but the question was how many resources were we trying to create?

This was the point at which we struck gold. There were nearly 30,000 vertex buffers, and 30,000 index buffers too. Compared to every game I’ve ever worked on, that is an obscenely high number. While there’s not supposed to be an upper-limit on these, it was easily believable that somewhere in the driver there was actually a limit set so high, that normally people wouldn’t bump into it.

A review on where these were mostly coming from, showed them as “Sprite”, which is basically one for each tile you see in the level map. While we never drew all 30,000 at once, we wanted to get them all ready to draw while the level was loading, so the game could draw the subset that was on screen without any performance penalty. As things stood each Sprite has a vertex buffer, and index buffer and a vertex declaration – all D3D resources, which our hypothesis reckoned would be in finite supply.

What we really wanted to do here, was to have a single D3D resource for each of vertex buffer, index buffer, vertex declaration that got shared across all our Sprites. This would also give us a minor performance boost, as we’d not be switching these out every draw call too. It was then a matter of finding a way to get this implemented, that preserved as much of the existing code as possible, because the more you change, the more the bugs will appear:

99-little-bugs-in-the-code

We could see a path to do this. We could introduce the concept of a master vertex/index buffer, which we could assign to our current buffers before they allocated their storage. This would allow them to behave as they did before, but now be a view on a larger buffer with a predetermined size and offset. All our references to those buffers occurred in the Geometry code, as it prepared to draw them. This was good, other than the setup difference, there would only be one other place needing a change.

Of course to allocate from a big buffer, it would need some memory management too. Unlike our existing memory manager, it could not keep its bookkeeping data in the memory it was managing, as VRAM is hideously slow to access from the CPU. Given the way we pretty much clear resources between the menu and the game level, this was not expected to have to deal with anything too crazy and could afford a somewhat simple algorithm. Still it was going to be twice as much code, as the actual VB/IB modification.

This was looking like a couple of days of effort, which in itself did not seem too bad. However we were doing this blind – we did not have conclusive proof that NVidia had a resource constraint in their driver. That seemed like a lot of effort to commit to a hunch. Was there a way of confirming this hypothesis first?

With a little bit of lateral thinking, there was. We already had a central vertex and index buffer, and we used them for our dynamic vertices – stuff that changes every frame. We were already setup to switch to a dynamic VB/IB with a single call. We just needed to preserve the data that was to be written, and provide a callback to fill it in every frame. Finally, the first typing frenzy could begin.

Around 50 lines later, all our Sprites were setup to archive their vertices and indices, and then copy them into the dynamic vertex/index buffer when required. Finally we were able to load into Level-9 and CU03 on Windows 10 – our hypothesis had been proven. We were just trying to allocate too many things.

Now we had proof this was going to work, putting a couple of days into it didn’t seem so bad. It was time for typing frenzy no. 2. Of course not everything went quite to plan, this is software development after all, but nothing unfixable was encountered along the way.

The biggest woe was that we’d gone over the 65,536 vertex limit that our 16-bit index buffers were able to access. Most of our code already supported 32-bit indices, so bringing this into Sprite was not overly onerous. However, as it turns out there were 4 visual effects which had somewhat extended the Sprite beyond its original design goal. As the engine developer, I had no idea this mischief was going on, but debugging to find the first did not take too long. Once that was discovered, it was then easy to extend our Lock classes for the Index buffer to check that the type returned matched the size of the index. The other 3 popped straight out, one after the other. We now were able to see our levels drawn correctly, and YES we could load into Level-9 and CU03, result!

i’s and t’s

Writing a new memory allocator is always a scary business. It’s easy to make a mistake while coding, but in the case of allocators the mistakes rarely show up straight away. They show up after hours of play, when suddenly they report that there’s no memory left and everything dies in a smelly heap.

I wanted some kind of smoke test, to make sure that this wasn’t going to happen. As we noted above, the amount of QA that we can call on these days, is somewhat limited. It needed 100’s of hours of play, but was only going to see a fraction of that.

It seemed like it ought to be possible to force the game to return to the menu, once it had been in the level for a couple of seconds. Then when it was in the level select menu to load the next from a list of levels. The soak test was born.

Nervously, we kicked it off. It started at the beginning, and started loading the levels: 1,2,3,4,5… Each time they loaded, we could see enough of the level, to know that the new Sprite were behaving, things were looking good.

…32, 33, 34 and then boom! 0x8876017c – “Out of Video Memory”

Were we unable to load level 35?

No, that level loaded fine on its own. That’s bad, that pointed to there being a memory leak somewhere, that had accumulated on the sequence through to level 35. As I noted before, I’m the engine developer here – I have very little idea how the developers were using my engine. That needed to change.

After some unsuccessful poking around, trying to deduce what resources were leaking, I concluded that this was like looking for a needle in a haystack. I knew that there were 30,000+ sprites allocated on the later levels, and I could see that only 43 were leaking per level. We needed some better tools.

We are still very much on version 1.0 of FLAYED, our engine. There are a lot of bells and whistles that more mature engines contain, which we just didn’t have. One of these is a Callstack and MapFile system. Where you can put a probe in the code, which will sample the Callstack (the path the code has taken to reach the point it is at).

In this case we would put it where each Vertex Buffer was created, and then look for the ones that had not been cleared when the level was destroyed. With that information in hand, we then turn to the MapFile, which is able to translate all those magic numbers, back into source code files and line numbers.

A few 1,000 lines of code later, we had a Callstack / MapFile system. For once the gods were shining on this bug, it listed 11 different callstacks that had resources outstanding. Following each of those through, came back to just 3 instances that were not being freed correctly. Fixing those, and we were able to progress. All in all, we were able to “play through” the entire game, including the DLC, twice in one sitting – all within 20 minutes. That’s good enough, ship it!

Submitting to SVN, we get the report:
412 lines changed, 2,399 lines added, 41 lines removed.

That is probably the largest amount of code that I’ve ever done for a single bug fix. Even though I know I’ve been thorough in my self-testing, it was still somewhat of a surprise to hear from Sierra that there were no bugs. Still, “no bugs” means that it can go live, and so it will – probably before you’ve read this! Patch now!

cs_025

Conclusion

Thanks for reading this far, I hope my writing has conveyed something of the roller-coaster nature of the path to the solution for this bug. The bug’s been open for 13 months now, and it gives me much pleasure to finally close it. I am so terribly sorry that things have not been in alignment sooner, to give it the time needed for resolution. We were so close, 6 months ago, but it was not to be.

I hope this now means that many of you are finally able to finish the game! I hope it gives you as much pleasure playing it, as I had making it. I don’t mean that flippantly either, of the many games I’ve done over the last 23 years, this one has been the most enjoyable to work on; James, thanks for the opportunity.

Andy Yelland
Engine Developer, Velocity 2X
Head honcho, AY Software Ltd