Tuesday will be remembered as the day the internet broke — before swiftly being fixed again. Early in the morning, websites including Amazon, Reddit, Spotify, Ebay, Twitch, Pinterest and, unfortunately, CNET went offline due to a major outage at a service called Fastly. Everywhere you looked, there were 503 errors and people complaining they couldn’t access key services and news outlets, demonstrating just how much of the internet relies on this largely unheard-of cloud computing service.
After an investigation into what happened, Fastly published a blog post into exactly what went down — and it turns out the whole incident was triggered by just a single, unnamed Fastly customer.
In mid-May, Fastly issued a software deployment that contained a bug, which if triggered in specific circumstances could take down vast swaths of its network. The bug lay dormant until June 8, when one Fastly customer inadvertently triggered the bug during a “valid configuration change,” which caused 85% of the company’s network to return errors.
“We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration,” said Fastly’s Senior Vice President of Engineering and Infrastructure Nick Rockwell in the blog post. “Within 49 minutes, 95% of our network was operating as normal. This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”
What happened during the Fastly outage?
At around 2:58 a.m. PT, Fastly’s status update page noted an error, saying “we’re currently investigating potential impact to performance with our CDN [content delivery network] services.” Shortly thereafter, reports emerged on Twitter of major news publications including the BBC, CNN and The New York Times being offline. Twitter itself was still running, although the server that hosted its emojis went down, leading to some odd-looking tweets.
Rather than isolated incidents affecting individual sites, it turned out this was a massive outage that had brought much of the internet to its knees. Across the world, people were receiving Error: 503 messages as they tried to access sites, including some vital services, such as the UK government’s gov.uk web properties.
Almost an hour later, at 3:44 a.m. PT — or 6:44 a.m. ET, on the cusp of the US East Coast workday, and coming up on noon in the UK — Fastly updated its status page again to say the issue has been identified and a fix was being implemented. At 4:10 a.m. PT, the company tweeted: “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online.”…Read more>>