Fixing Record-Breaking Outages with Expert Incident Response

When the world woke up on Friday, July 19th - pigs seemed to be flying. For the second time in 2 years, planes were grounded, IT systems were nonfunctioning, and Fortune-500 corporations couldn't access their data...all they could access was the infamous Microsoft "blue screen of death". And people were panicking. What is now considered one of the largest IT outages in history was triggered by a faulty software update from security company, CrowdStrike, affecting millions of systems around the world - including a few of our customers. 

The Problem: Loss of Microsoft Azure Access

At Centre, we pride ourselves on being able to provide and manage the best solutions that the IT industry has to offer. But even more than that, we promise those solutions come with the best incident response to keep you online whenever an outage occurs. Unfortunately, due to the unexpected and widely influential CrowdStrike outage, our quality solution, Microsoft Azure, lost connectivity and our team had to act fast to get customers back online. The only problem was, this outage occurred from a nation-wide issue, one that would require more than just our technicians to fix. 

Nevertheless, our team went to work.

In total, 6 of our customers lost access to their cloud-based Microsoft Azure desktop systems. Customers experienced issues with multiple Azure services including failures with service management operations and availability of services. 

Impacts to Business Operations

As the outage unfolded, first, our customers' Microsoft products were affected. This piece directly impacted their Microsoft 365 and Azure desktops. 

Customers with the faulty agent installed began to Bluescreen (BSOD, "Blue Screen of Death"), leaving them unable to not just access data, but access anything. For one customer, their entire Domain server went down, leaving their hardware useless as well. Another customer had all servers affected, debilitating them with an across-the-board BSOD to be cleared. Additional customer issues included Datacenter outages which impact business continuity, lost revenue, and reduced productivity. At the root, customers weren't able to operate without a timely fix to their issues.  

Cloud solutions like Azure are excellent at two things: data management and data backups. But when your infrastructure goes down, what do you do? Who do you call? Do you have a backup plan in place? Who is managing your system? If you don't know how to answer those questions, you're sunk if another outage ever occurs. 

The Solution

  •  First Microsoft products were affected. This piece directly impacted 365 and Azure. We identified the issue and immediately (Within 5 minutes of the first report) started communications with the affected customers. 
  • When the customer's entire Domain server went down, Centre immediately began a restore to an uncompromised version of the server and restored systems within 30 minutes of the auto generated ticket.
  • Once the major outage was identified (before it was announced by CrowdStrike) we initiated our "After Hours Break Glass" protocol. This emailed and texted key CA resources to an emergency bridge. While on the bridge, we identified all outage issues and began working through the issues.
  • The issues were broken into 3 categories: 365 outage, CrowdStrike, and a Datacenter outage . Each outage was managed separately until all customers where verified fully operational.
  • We were able to identify and spring major management and resources into action within record time, utilizing automation and proven procedures. Our processes allowed us to stay organized and communitive with our customers.

Learn More

Whether it's a record breaking outage impacting multiple customers of one customer experiencing something that, to them, is a big problem, we don't discriminate. We're committed to making sure that our customers stay online and protected from future problems. 

Want to learn more about how to keep your systems backed up and prepared for when disaster strikes? Check out our events page for our exclusive live Hot Tech Talk event where we'll break down 4 major categories you need to have in place to stay online in the event of an outage. 

Have questions? Feel free to contact us to get them answered. Talk soon!

Originally published on August 20, 2024

Be a thought leader and share:

Subscribe to Our Blog

About the Author

Emily Kirk Emily Kirk

Creative content writer and producer for Centre Technologies. I joined Centre after 5 years in Education where I fostered my great love for making learning easier for everyone. While my background may not be in IT, I am driven to engage with others and build lasting relationships on multiple fronts. My greatest passions are helping and showing others that with commitment and a little spark, you can understand foundational concepts and grasp complex ideas no matter their application (because I get to do it every day!). I am a lifelong learner with a genuine zeal to educate, inspire, and motivate all I engage with. I value transparency and community so lean in with me—it’s a good day to start learning something new! Learn more about Emily Kirk »

Follow on LinkedIn »