Postmortem: Liquid Web Outage

Postmortem: Liquid Web Outage

Overview

On June 11, 2024, our primary web hosting provider, Liquid Web, experienced a major outage that affected nearly all our VMWare cloud client servers, websites, backups, load balancers, and internal networking services. This event impacted hundreds of clients, including our own MMG and various family office websites, applications, and services. Service degradation began on Monday, June 10th, at approximately 4:30 AM CST, and all SAN services were fully disrupted by 6:30 AM CST. Services were fully restored by 7:00 PM CST on June 11th. We began notifying clients via our customer mailing list at approximately 7:30 AM CST on June 10th, with email updates sent every 2-3 hours throughout both Monday and Tuesday to keep everyone informed about the situation.

LiquidWeb Status Updates During the Outage:

  • June 11, 2024, 5:45 PM CST: MMG's cloud environments were restored, and engineering staff began restoring client websites back online shortly after. A final restoration email was sent out notifying everyone at 7:00 PM CST. ,
  • June 11, 2024, 08:44 EDT: Services have been restored for most of Liquid Web's impacted customers; however, intermittent issues were observed during the restoration activities.
  • June 11, 2024, 06:56 EDT: Liquid Web collaborated with NetApp North America leadership and senior engineers, nearing a resolution path by mid-morning.
  • June 11, 2024, 03:17 EDT: Continuous efforts with NetApp North America leadership and senior engineers.
  • June 11, 2024, 00:00 EDT: Ongoing collaboration with NetApp North America leadership and senior engineers.
  • June 10, 2024, 22:07 EDT: Isolated issues to NetApp-based products and services, engaged multiple active bridges with senior NetApp engineering teams.
  • June 10, 2024, 19:33 EDT: Diligent work to resolve issues affecting VMware and SAN-based services.
  • June 10, 2024, 17:15 EDT: Continued work on resolving issues affecting VMware, Shared Load Balancer, and SAN-based services.
  • June 10, 2024, 15:42 EDT: Active resolution efforts on VMware, Shared Load Balancer, and SAN-based services.
  • June 10, 2024, 14:28 EDT: Continuous work by Systems and Network Engineering Teams.
  • June 10, 2024, 13:28 EDT: Ongoing efforts to resolve the issue.
  • June 10, 2024, 12:09 EDT: Identified the source of service impact and implementation of a fix.
  • June 10, 2024, 09:31 EDT - 06:42 EDT: Investigation of the issue.
  • June 10, 2024, 06:23 EDT: Initial investigation of problems affecting VMware, CloudSites, and load balancers.

Lessons Learned and Future Actions

This outage has opened our eyes to unique opportunities for improvement and further fault tolerance redundancy. While we had spent considerable time planning for such events, this incident revealed new vulnerabilities and allowed us to enhance our preparedness for future disruptions.

1. New Cloud Hosting Stack: We have decided to build a new cloud hosting stack outside of VMware to act as a disaster recovery point. This will ensure we have an additional layer of redundancy should our primary and secondary hosting infrastructure fail again.

2. Enhanced Fault Tolerance: This event highlighted the critical need for improved fault tolerance, particularly in accessing cloud storage. Our inability to access our cloud storage to pull files and databases was a significant failure point. The shutdown of Liquid Web's SAN was the primary choke point that prevented us from restoring services promptly.

3. Quick Deploy Landing Pages: We are creating landing pages that can be quickly spun up in the event of a total network loss. This will allow us to capture website traffic rather than displaying dead or blank pages for an extended period.

4. Premium DNS Services: For clients interested in enhanced redundancy, we offer premium DNS services from Cloudflare, our preferred DNS provider. Upgrading to premium DNS for an additional monthly fee allows us to instantly deploy emergency pages during an outage or cyber attack. MMG has implemented this for our own company and encourages concerned clients to consider this option.

5. Future Preparedness: Although we have spent years planning for outages, this incident has opened doors to newer scenarios and allowed us to better prepare for the future. We are committed to ensuring our services are robust and resilient, learning from this experience to strengthen our disaster recovery and fault tolerance strategies.

Conclusion

We appreciate your patience and understanding during this challenging time. Your support means a lot to us, and we are dedicated to continuously improving our services to ensure reliability and resilience. If you continue to experience any issues with your website or services, please reach out to our support team for assistance.

For clients interested in premium DNS services and quick deploy landing pages, please contact us for more information.

Hire the team to help you with your website, app, or other marketing needs.

We have a team of digital marketers who can help plan and bring to life all your digital marketing strategies. They can help with social media marketing, email marketing, and digital advertising!

CONTACT US

Comments