OGRank - Pirates of The Burning Sea: stress test report

With 15,000 people downloading the 4Gb client and joining the POTBS stress test weekend event, it is not difficult to consider it a real success. The devs have some thoughts to share about the real income of these tests: bug hunting and server infrastructure benchmarking.

With the help of the fine folks at FilePlanet and SOE, we just ran our first Pirates of the Burning Sea stress test weekend. The event was an unqualified success. Despite all the painful downtime and very long days, we learned things about our servers that we simply couldn’t have learned any other way. Fortunately the stress testers were good sports the whole weekend and every time the servers came back up the testers were always ready for more.

Starting last Monday, FilePlanet began to distribute stress test keys and to allow people to download the 4GB client installer. Midday on Wednesday, they opened the test to all FilePlanet members (not just subscribers). On Thursday at 3pm Pacific time, SOE activated the keys, and we put up the servers, letting all 15,000 key holders log in for the first time. The response was immediate. Hundreds of people were in and playing within a few minutes. The numbers continued to climb for nearly three hours. Everything with the servers looked really good, and the hardware itself was actually pretty bored.

That’s when everything went horribly, gloriously wrong.

BigBrother
If you’ve read Brendan’s recent server tool devlogs, you’ve heard of this server process we have called BigBrother. BigBrother is responsible for keeping all the other server processes up and running. In fact, that’s pretty much all BigBrother does. Short of server crashes, most of the processes don’t go down, so BigBrother’s job is all about the Zone servers.

Each instance in PotBS requires a fresh zone server to start up. We keep a number of these around in the Idle state, so generally there’s one ready to go when someone enters an instance. Normally, the next idle server in the list is told to load the instance, and then the player zones in. If there aren’t any available for some reason, no big deal; BigBrother starts them constantly, so another will be ready soon.

At least that’s the theory. Unfortunately once all those players hit the servers at the same time, it didn’t work out that way. A series of bugs conspired to keep idle zones from starting nearly quickly enough to keep up with demand. So three hours into the playtest we ran out of idle zones and very quickly every player on the cluster was waiting at a loading screen for an idle zone that would never come.

To make matters worse, we had another performance problem on the database that is responsible for keeping track of which processes BigBrother already had up and running. Shortly after we ran out of idle zones, about 75% of the processes in the cluster (including most of the BigBrother processes) shut down because they lost contact with the “server directory.” That’s exactly what they are supposed to do, as a means of preserving player data, but losing so much of the cluster disconnected most of the players.

We spent the next four days fixing the idle zone spawning problem and the server directory performance problem. We pushed out a dozen new versions of various servers (most of them BigBrother) over the course of the weekend. Misha, most of the programmers, and all the operations people were in until well after midnight every single night between Thursday and Sunday. We’ve all mostly recovered at this point, but boy were we tired by Sunday night. :)

Five Big Bugs
Five different bugs conspired to keep the idle zone processes from working correctly. Each of them individually wouldn’t have caused such a big problem, but when their powers combined, the infinite Waiting for Idle Zone screens began.

The first of them was pretty funny. If we ran out of idle zones at any point and had to wait for them to start up, we would put the people who needed to zone into a list and then pull them out one at a time as new zones become available. The problem is that we pulled them out in the opposite order than we put them in. If you were the first one to get in line for a new zone, you were the last to get one! If people were being added to the list more quickly than idle zones were starting up, only the last person in the list had ANY chance of ever getting a zone. Oops! Fortunately the fix for this was small and easy.

The second major problem was actually more a configuration problem than a code problem. The logging database, which the zone servers connect to when they start up, was a little confused and was rejecting about half the attempts to connect to it and slowing the others way down. Normally an idle zone takes about 3 seconds to start up. With this database problem they were taking more like 30 seconds, and half the time they didn’t start up at all. A SQL Server restart fixed it, but we didn’t realize this was actually the problem until later in the weekend.

To make the matter of not connecting to the database worse, we had another bug that caused any idle zone that didn’t successfully connect to the flogger to hang when it tried to shut down. Not only did that take up load on the server, but it also made BigBrother think the server was still starting up.

The fourth major problem we encountered was that a zone that hung on startup was actually able to tie up a zone slot and eventually stop BigBrother from starting any new zones at all. To keep from overloading the servers, BigBrother is limited in the number of servers it will start at any time. Every time one of the zones failed to connect, it took away one from the total number of servers BigBrother would start. After a while, BigBrother stopped spawning new zones entirely because it thought all the previous requests it had sent were still pending. We didn’t realize the root of the problem (that the DB needed a restart) until later, so this bit us for the first couple days of the test.

There were a number of different cases where BigBrother could get confused about how many processes were actually spawning and eventually stop being able to start more. It took most of the weekend for us to investigate and fix these cases, but at this point BigBrother is spawning servers just as well on a loaded server as on an empty one.

Finding the problems
The biggest problem we had during the test wasn’t actually figuring out how to fix the problems, it was figuring out what the problems were in the first place. If one of these problems cropped up on one of our development servers, we’d just pop open the debugger and look around in the various processes involved until we figured it out. Unfortunately that’s not an option in a cluster that’s under load, which the stress test cluster was all weekend.

The next best option is to turn on existing logging to figure it out from the logged events. We have logging scattered throughout the game and can selectively turn it on and off depending on what we’re debugging. Unfortunately the amount of logging in BigBrother at the start of the test was pretty meager.

My least favorite option for finding the source of a bug is to modify the code to add additional logging, since anytime you add new code it’s a risk. But that’s exactly what we had to do with BigBrother during the stress test. On Thursday and Friday Brady was sending new versions of BigBrother up to the operations people every couple hours. The trouble with adding logs to look for a bug that you don’t yet understand is that you don’t actually know what logging to add. It was sometime on Friday before we even knew that BigBrother was stopping itself from spawning new zones because of the throttling code. The actual fix wasn’t ready to go until sometime on Saturday.

Fortunately, as with every other bug that we’ve added logging to track down, this new logging code stays in the game. The next time we need to track down a bug with process spawning in BigBrother it will be a piece of cake.

Long Wait Times
At a little after 1am Friday morning, Brady and I were no longer in any condition to write code. We went home at that point, and got some sleep before heading back to pick things up again Friday morning. Unfortunately Gray Noten, our operations lead, wasn’t so lucky. The GMs called him every 30 minutes or so all night long to reset the servers that had, once again, run out of idle zones.

To avoid him having to do that two nights in a row, we turned on the login queues. We dropped the maximum player count on the servers down to a point that we knew they could easily handle for the overnight stretches so we could get some sleep. This worked pretty well Friday night, so we did it again on Saturday, with a somewhat higher limit. By Sunday night we had solved enough of the problems that we were able to keep the limits off all night.

These login queues were very frustrating for a lot of people. They did allow the people who made it into the game 6-8 hours of uninterrupted server uptime, and gave our operations people a much needed rest. I think they were the right thing to do, but now that the clusters are supporting much higher populations, we will hopefully never need to do it again.

Saint-like Patience
The big heroes in this story are the stress testers themselves. They braved constant server reboots and long overnight queue times all weekend long and just kept coming back for more. We did what we could to keep them informed about the state of things, and whenever the servers came back up they were immediately back to pound on things again.

I’m incredibly grateful for the patience shown by this weekend’s testers. The game is going to be much better as a direct result of your efforts. Your persistence helped us more than you can know. Those of you who didn’t make it into the stress test have these people to thank for the server stability you see when you do get a chance to play.

The Results
So with all these problems, how can I call this an “unqualified success”? Well, pushing the servers until they broke was the entire point of this test. That’s exactly what the weekend’s stress testers did. And they did it over and over and over.

We found about 6 major bugs (including the server directory slowness) and have put in fixes for every single one of them. By the end of the weekend we were able to keep up with the testers with no queues and no problems. We also learned a lot about exactly how high levels of stress affect the servers. We will use that knowledge to improve our automated testing to better simulate actual players and push the servers even further. That work is already underway this week and by next week, we expect to be running automated stress tests internally that will accurately recreate the conditions we had in the live stress test. That will let us bang on the servers a bunch more ourselves so we’ll be in even better shape before our next big public event.

We pushed the server architecture further this weekend than ever before. We hit concurrency numbers we’ve only hit with automation. We ended the weekend supporting way more players at the same time than we did when we started. This was a fantastic test for us and I can’t wait for the next one to see what happens when we push it even higher!