Facebook was hit by an outage on Monday that lasted almost six hours, making not only its platforms inaccessible to users but also preventing staff from accessing its internal network too.
The company has blamed a “faulty configuration change” and Mark Zuckerberg apologised to the 3.5bn people who were unable to use the WhatsApp, Instagram and Facebook platform.
“Sorry for the disruption today – I know how much you rely on our services to stay connected with the people you care about,” he said.
So what went wrong?
The issue appears to be have been caused by a networking issue, in particular an update which broke how the company advertises where its servers are to the internet using something called the Border Gateway Protocol (BGP).
The internet is an international network of computers, including the phones in our hands and the servers hosting enormous amounts of data in specialised warehouses around the world.
Our phones are able to access the data in these warehouses because routers help them send requests for information to these warehouses, which they find through BGP.
Live updates as WhatsApp, Facebook and Instagram down in major outage
WhatsApp, Facebook and Instagram outages caused by ‘faulty configuration change’
Facebook chooses ‘profit over safety’, claims whistleblower
Facebook went down because the “faulty configuration change” meant that it stopped telling routers where its data centres were, it appeared to the routers that they simply didn’t exist.
How bad is that?
Normally it would be quite straightforward to fix this kind of outage – you start advertising where your servers are and routers begin connecting to you again.
Unfortunately, it seems that Facebook was using the same network for staff to access the network remotely, meaning that the outage prevented them fixing the outage.
It took down Facebook Workplace, the internal communications platform used by its staff, according to numerous reports.
Then when staff tried to sign-in to third party apps to communicate, if they used the “Log In with Facebook” option they discovered this was down too.
The only solution was to physically go to the data centres and refresh things from there – but there was a problem with this too, according to a now-deleted post on Reddit believed to have been posted by a Facebook staff member.
The access cards that Facebook use to physically enter its premises were also dependent on the internal systems working properly. The outage meant that they couldn’t authenticate to the premises and get inside.
Shouldn’t a big company have better processes?
It is by the far the longest outage for a major platform in recent memory.
Perhaps there should have been better processes in place, but it can be hard to recognise single points of failure – especially when they’re based on such a core part of the networking infrastructure – until they fail.
Was it the result of a hack?
Despite the prominence of Facebook’s platforms when it comes to spreading most conspiracy theories, unfounded claims that those platforms were down as the result of a hack managed to spread without it.
At the same time as the outage there were some posts on underground hacking forums claiming to be offering user data stolen from Facebook, but most experts assess that these are opportunistic scams rather than real breaches.
Could it happen again?
One would imagine Facebook would introduce some new processes now to prevent this kind of cascade of errors in the future.
The series of failures, from the BGP outage itself through to the way it blocked staff accessing internal communications systems, remote access, and even physical access itself, should probably result in Facebook federating these structures a bit more so they’re not all reliant on the same system.
Isn’t the timing a little coincidental?
Yes it is, but crucially coincidental rather than linked.
Facebook whistleblower Frances Haugen is set to testify before US Congress today after leaking thousands of documents to both the Wall Street Journal newspaper and law enforcement.
She is set to explain how “over and over again” she saw “conflicts of interest between what was good for the public and what was good for Facebook. And Facebook, over and over again, chose to optimise for its own interests.”