When DraftKings migrated to a microservices architecture, we were plagued by relatively unimportant systems and features having a significant impact on the overall reliability of the site. An error or a piece of unhealthy infrastructure would impact the site globally rather than degrading the functionality that was having issues. We developed a custom circuit breaker framework named Ground Fault to improve resilience, monitoring, and problem diagnosis. This framework developed in response to production incidents. The presentation will be structured around a “war story” followed by the functionality that we added to the framework.
Episode 1: Just How Many People Are in that Contest?
In November of 2014 we ran a free contest with well over 100,000 unique users. At the time this was the largest contest we had ever run. Turns out, there was a page on the mobile web site that displayed the user names of all the participants. This was a very low volume page and was not included in any of our load tests. When that contest started, the site went down and went down hard. Our database—yes we had one at the time—decided now was the time for a new query plan. The query would time out, but the next request would make the database sick all over again. Our toolbox for identifying and fixing this problem was limited at the time. It was a very long day.
This incident spawned our re-architecture toward microservices. As part of this work we created the initial implementation of the circuit breaker based on a Martin Fowler article and Hysterix from Netflix. The initial implementation focused on protecting the database from rogue database queries. Features included timeout counters as well as trickling test commands to determine if the system was now responsive.
This section will outline the basic architecture of our circuit breaker. This will include some code samples (in C#).
Episode 2: I Suspect It Was the Microservice in the Data Center with a Saber
As we were building out our shiny new microservices architecture, we had a sudden and inexplicable (to us) downtime. PagerDuty started calling in the troops as the team started pouring through logs, CloudFront dashboards, and the Datadog dashboards we had created manually. After far too long, we determined that our microservice that supplied basic sports statistics, Saber, was severely underprovisioned based on some newer call patterns.
We took advantage of the chokepoint that our circuit breaker provided and added generic instrumentation that supplied performance metrics to Datadog. This allowed us to create some standardized dashboards and metrics that make diagnosing these issues much quicker and easier from a developer standpoint.
We will show a brief demo of this generic dashboard and how we use it to uncover performance problems.
Episode 3: Gronk Spiked Our Servers
We have a strange phenomenon at DraftKings. When a popular athlete scores, our users will rush to their phones to check the scores. Our traffic will go up six to eight times within seconds. We affectionately call this phenomenon a “Gronk Spike” in honor of the Patriots' tight end.
We went into the 2015 NFL season with a shiny new microservices architecture. Our load tests looked good and we went in with real pride in what we had accomplished. Then we got “Gronk’d.” Sam Bradford threw a touchdown pass to Demarco Murray and millions of people went to their phones to check their scores and millions of customer’s balances were suddenly being looked up. Our finance database—look we have more than one database now—was suddenly flooded with traffic and went down.
Step 1: Add more read replicas
Step 2: A bit more complicated
We created a custom thread pool with concurrency limits for our microservices that 503’d much more aggressively. The challenge with these 503s is that they would then trip our circuit breakers, which would in turn flip other circuit breakers and then the site would take a while to recover. We implemented custom concurrency limits in Ground Fault as well as custom headers that allowed the circuit breakers to ignore certain kinds of errors.
This section will show some conceptual sequence diagrams of how we manage concurrency.
Epilogue: What is Ground Fault Today?
This section will include what the overall circuit breaker technology looks like today after a couple of years of evolution.