Throughout my series of posts on the importance of a great backup strategy, there is a common theme: Enterprises should ALWAYS plan for the worst-case scenario. If you missed the first installment, in which I will be sharing many technology challenges that could have been avoided with a proper backup strategy, it can be found here.
As we continue down the journey of the series of real world enterprises, I will share another unexpected situation that was survived by good backups, and we will highlight the lessons learned in this situation.
As we embark on the new year and with Valentine’s Day is right around the corner, I am reminiscent of years past, and no I am not thinking of flowers and chocolate. My memory is of the midnight phone call I received the morning of February 14th , letting me know there were users who couldn’t access Microsoft Exchange and some applications in Citrix. At the time, being the person on-call, I powered up my laptop, logged in, and evaluated the situation. When I couldn’t use RDP to connect to some of the systems, it was time to leverage my other console-based tools to check on the servers.
At first, I thought maybe I was seeing things or even connected to the wrong system. Much to my surprise, I had multiple Exchange servers and Citrix servers that had a Windows Desktop operating system being deployed to them. Yes, you read that correctly. Desktops overwriting servers. That in itself is a whole different conversation revolving around automation security and ensuring that something like that cannot happen. More importantly, in that moment we needed to get the desktop team on the phone to stop the chaos. Once they identified the security gap and closed it off, the remaining desktop deployments were stopped. As it turned out, there were a couple hundred servers impacted before all was said and done.
Well this quickly became an “all hands-on deck” situation whether the person on my team was on-call or not. We had a huge recovery effort that needed to be spearheaded. First, we prioritized the list of servers based upon the level of impact. Thankfully many of these systems had redundancy that was maintained, so applications such as Citrix and Microsoft Exchange, while important, didn’t experience significant downtime. They just needed some failovers if it didn’t automatically happen. Once we had our prioritizations complete, it was time to validate that our backup and recovery strategy was a good one.
In this case, the backup strategy was very thorough. All systems impacted had backups, and by around 7 am that morning, when the workday started, all critical systems were back online. While that doesn’t sound terrible given the events, this backup solution didn’t have mass restore capabilities. The recovery process was therefore still a bit slower than it could have been. Beyond the critical systems, it took about a week to get the rest of the less critical servers back online.
Another key takeaway from this situation would be to have implemented a backup solution that offers the ability to do mass quantities of server recoveries. Meaning that more than one server can be recovered at a time. This feature could have greatly improved the recovery time of the servers during the issue.
Stay tuned for next time when I talk about why you should be backing up your Office 365 Exchange online environment.