Imagine you are on your latest Microsoft document working on an essential task in the cloud, that needs to be finished as soon as possible, but suddenly you start to face issues… This is exactly what happened to all Microsoft users last week.
On March 15th, Microsoft Azure Cloud services faced an outage of up to four hours, due to the recently changed authentication system in Azure Directory Services. At 7:00pm coordinated universal time, Microsoft users began to face authentication problems, not only limited to Cloud Services, but Teams, Exchange, and all Office products were also affected by the outage.
After one hour, the Microsoft team finally came into action and started to work on the problem, and at around 9:05pm coordinated universal time, Microsoft rolled back the last state changes. Although the entire process took hours before all services were completely restored. The outage in cloud services and in Teams caused absolute frustration amongst both, the employees and the workers who have been working online during the COVID-19 pandemic.
While on the other hand, Microsoft Teams was a hot and trending topic on Twitter, as it was celebrated by Microsoft users and students who were happy because they didn’t have to work, take any online lectures, or submit any assignments. Below is a list of the products that affected Microsoft users by the outage.
Products that were unavailable to Microsoft users during the outage:
- Microsoft Teams
- Microsoft Office
- Office related Web Applications
- Microsoft Streams
- Xbox Gaming and Streaming Services
- Azure Directory Services
Microsoft software is used by both businesses and students alike due to Coronavirus, was reported to be down by more than 25000 people. Some 2500-3000 people also talked about the Office outage. Even the premiere of the new ‘Justice League’ movie, directed by Zack Snyder was delayed for hours due to the severe outage.
On the following day, Microsoft explained on Twitter, that “most of their services” are currently working and functional, while restoring other services is still underway and is being monitored. In a detailed report Microsoft said, “Over the last few weeks, a particular key was marked as “retain” for longer than usual to support a complex cross cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.”
When the metadata about the keys changed on 7:00pm UTC, the problem started to emerge which then caused the outage that lasted for more than five hours. However, the problem was identified, and a roll-back solution was implemented but it took longer than expected.
Microsoft further explained as “Applications need to pick up the rolled back metadata and refresh their caches with the correct metadata. Time to mitigation for individual applications varies due to a variety of server implementations that handle caching differently.”