Launcher Outage March 17th

First Riot Post
1234511 ... 14
Comment below rating threshold, click here to show it.

Rincewind

Senior Systems Administrator

03-17-2014
1 of 3 Riot Posts

The details of today's outage does warrant a face palm. It is a typical tale that can be heard from almost every startup that goes big in a short frame of time.

How does one track systems that are in production and not in production? When you are a small shop, you do it on in a simple web page. Over time you add features to it, and one day you find it cumbersome, and replace it with new software that tracks your inventory.

We moved to a newer system. It is fast, it has all the features to image systems at a rapid pace, and it is reliable. However it is only reliable, as long as the data in the inventory system is up to date and can be trusted.

As part of migration to newer systems, we decommission older servers. That was the goal today, to decommission older servers. Unfortunately it turned out to be a nightmare causing a global outage.

What we call the launcher, the itty bitty little button that says "PLAY", that players hit to transition to the login screen within the client. We seed the state of the launcher for different regions from a set of central database servers. They also serve other content as well such as news, and the pages you see right after logging in. This part of the system is fairly static and doesn't change often.

The content is served out from the database, and it is cached in the web servers. Since the content is cached for quite a while, there is not a whole lot of traffic to the database. Today as part of the decommissioning of the older servers we shut a few servers down that were not in production according to the inventory tool.

You can guess what happened next.

A few minutes later, the launcher, the news, and pages in the client, disappeared. The launcher showed undefined instead of play. Players around the world experienced, a collective "We cannot login!!!!"

The caching web servers expired their content, and reached out to the databases to get fresh content, however they couldn't reach the databases, as they had turned off. We accidentally turned off servers that were in production, except these servers were mislabeled as not in production.

After a few frantic phone calls to the folks in the datacenter, the database servers were booted up. Database servers do what it does after an improper shut down, it goes into the mode of recovery and checks the integrity of the data. Depending on the size of the data and recovery process can take some time. In our case it did and we had to run our data integrity checks.

In parallel we started implementing a work around to get the play button fixed in case the databases took a while. The databases did take a while to recover, so in the end we used the work around to restore a working state, although not completely recovered. The players could now see “Play” instead of “undefined” and login again.

FAQ
1) WTF Rincewind this sounds so amateurish, how could you turn off systems without verification?
This is a training aspect and we are working on educating folks on how to use the tools efficiently, but also verify that the information presented is indeed what it is. It is also an artifact of an environment that is moving from an old process to a new process. Stuff like this does happen in shops that went from a small startup and made it big in a short time.

2) Ok, why on earth did you not fall back to the slave databases and bring everything back up.
We accidentally powered down the slave databases too as part of the decommission.

3) Err, it sounds you have some single points of failure.
Yes and we are fixing it right now. We are ensuring that no central database exists that seeds information for all regions. All the systems are being decoupled and becoming their own service so that they don't take out each other.

4) How can we trust you to not this repeat this again in the newer datacenters like the one in Oregon.

We are building the Oregon datacenter with a clean slate. We will not have the old tools or potentially stale information that would affect us. We have better tools to manage this environment without being affected by legacy servers or inventory information.


Comment below rating threshold, click here to show it.

Go and Uninstall

Member

03-17-2014

wee ha

Secondly, thank you for taking your time and explaining the situation to us.


Comment below rating threshold, click here to show it.

Summoner Rokutaa

Senior Member

03-17-2014

Thanks for the information.


Comment below rating threshold, click here to show it.

Compenisator

Senior Member

03-17-2014

I appreciate you taking the time to explain what happened to us in such a detailed manner. Thanks.


Comment below rating threshold, click here to show it.

CommandShockwave

Senior Member

03-17-2014

This is the kind of Riot post I love. It gives detail on the issues we've experienced. +1


Comment below rating threshold, click here to show it.

I Surge I

Senior Member

03-17-2014

Wow a RED!


Comment below rating threshold, click here to show it.

Senstrae

Senior Member

03-17-2014

I respect the brutal honesty here, but how rocky are things going to be while you work out the full fix and recovery?

Was... anyone's account affected, or is it *only* the login trouble?


Comment below rating threshold, click here to show it.

best fizz alive

Senior Member

03-17-2014

will i still be lagging after the patch or should i switch to dota


Comment below rating threshold, click here to show it.

Worst Talon Ever

Senior Member

03-17-2014

ᕕ( ᐛ )ᕗ


Comment below rating threshold, click here to show it.

Go and Uninstall

Member

03-17-2014

that double post?


1234511 ... 14