When a massive global cloud service outage does not sound major alarms in world governments about cloud resilience failure, read this…
Following AWS’s fourth outage in five years for the US-East region, analysts from Forrester are reiterating other experts’ years-long calls for an overhaul of cloud resilience management.
With only a handful of major cloud service providers powering much of the world’s public web services, a single technical glitch or cyber incident can result in similar or even worse global disruption.
According to principal analysts Brent Ellis, Alla Valente, Lee Sustar, Devin Dickerson and Naveen Chhabra, this particular outage exposes core issues with cloud resilience that stem from overreliance on services such as DNS, which were not architected for cloud-era technology demands. It also highlights how concentration risk — a dangerously powerful yet routinely overlooked systemic risk — arises when so many firms across all industries become dependent on a single cloud provider and, more pertinently, a single region covered by that vendor.
The problem also goes beyond internal AWS regional dependencies, into the logical dependencies across the platform. DynamoDB, the first Amazon service identified as impacted by the DNS issues, plays a central role in other AWS services for analytics, machine learning, search, and more…
Other implications of poor cloud resilience
Convenience often overshadows navigating the complex, nested dependencies in highly concentrated environments. Despite past outages, organizations that failed to address that complexity got a front-row seat as cascading issues disrupted systems, processes, and operations.
The entrenchment of (public) cloud services in modern enterprises, coupled with an interwoven ecosystem of SaaS services, outsourced software development, and virtually no visibility into dependencies, is not a bug: it is a feature of a highly concentrated risk where even small service outages can ripple through the global economy.
What tech leaders should do now
From a cloud resilience perspective, enterprise tech leaders have two lines of action they need to pursue now. One is to build the tools to increase technology systems’ reliability; and two, to address contractual grey areas related to shared responsibility models with cloud (and SaaS) vendors.
On the technology side, the analysts recommend the following measures:
- Invest in infrastructure observability and analytics. It is the first line of defence for production systems, giving teams early visibility into outages so that they can respond with workarounds or alternative infrastructure. Otherwise, users will be relying on a cloud provider’s blog describing the outage when it has already taken down key operations.
- Build an infrastructure automation platform. In order to fix things as early as possible, all observability data and correlated analytics need to be connected to automation to respond while problems are still small and manageable. These capabilities converge in AIOps platforms, but each capability is independent and should be considered strategically. Third-party tools can give teams a bird’s-eye view of the overall cloud estate, especially in multi-cloud environments.
- Use content delivery networks to cache static content at edge locations from outages. That will not be cheap, but neither is an outage that knocks down critical IT operations and leaves the corporation waiting helplessly.
- Develop application portability and additional clouds for key workloads. If you have a critical application, be ready to move it on a dime. This may mean a disaster recovery architecture to another region, cloud, or data center. It may involve investment in data resilience tools or replication technologies. The details will depend on specific application needs, and evaluating those needs should come from a well-designed risk management process. Focus investment on functions that affect customers, drive critical infrastructure, or move money.
- Test infrastructure and application for resilience. Use chaos engineering tests to figure out how an infrastructure’s applications fail, and design ways to avoid failure. For disaster recovery and backups, test them to make sure key steps have not been missed; that the processes are clear; and, when it is a security-related matter, all teams coordinate well with those responsible for securing enterprise systems and data. Supplement those catastrophic ransomware-response tabletop exercises with workshops on how to withstand protracted outages or maintain transaction integrity during short-term disruptions.
For managing third-party risk in cloud and SaaS suppliers:
- Understand the limitations of regulations. The EU’s Digital Operational Resilience Act (DORA) is an attempt to improve resilience of critical infrastructure, but it has limitations as it is aimed more at the financial sector, and focuses on the responsibility of cloud customers while not emphasizing the role of hyperscalers in improving the core resilience of their systems. Organizations should not confuse being compliant with being resilient. Instead, identify the three main sources of risk, model scenarios, and create mitigation plans to minimize the pain of disruption. Also, focus on continuous testing of disaster recovery and operational plans under realistic conditions, complemented by coordinated cross-team efforts, especially in organizations with global offices.
- Map critical dependencies. Identify and map all third-party and cloud service dependencies to the technology assets and business processes they support. Focus on customer-facing apps and single points of failure and hidden connections that could amplify outage impacts. Do not settle for web documentation: Insist that all cloud technical account managers walk the firm through the specifics of their environments.
- Reevaluate third-party risk strategies and approaches. If a third-party risk management program is overly focused on compliance, firms may miss significant events that impact even compliant vendors. Tech leaders cannot afford to overlook assessing the vendor against multiple risk domains such as business continuity and operational resilience, not just cybersecurity. Tech leaders also need to map their third-party ecosystem to identify significant concentration risk among vendors, especially those that support critical systems or processes.
- Use the contract as a risk mitigation tool. With major technology outages becoming all too common, work with procurement and legal to update or add clauses that assign accountability during disruptive events and clearly outline time frames for vendors to patch and remediate. Consider using such incidents and their impacts as a basis for implementing measures in contracts or service-level agreements. For determining financial compensation or discounts for downtime, be prepared to bargain for it. If vendors push back, consider whether it still makes to do business with them at all.
- Prioritize continuous monitoring and corrective action. When organizations identify third-party risks but do not act on them, risk management efforts stagnate or fail. All third parties are dynamic entities, and their risk, compliance, and resilience posture will change over time. Augment annual assessments with continuous monitoring tools that can identify vendor changes in real time. Close the loop with third-party risk management program platforms that automate issue creation, launch remediation plans, and trigger notifications required for approval and verification that the risk has been addressed within risk-appetite or regulatory requirements.
- Test vendor resilience plans. Require validation of vendors’ recovery and continuity plans through tabletop exercises and outage simulations.
Global cloud reassurance framework sorely needed
Aside from the above actionables by Forrester analysts, several crucial macro-level aspects, especially at government and regulatory levels, can help industries augment comprehensive resilience strategy:
- Institute strong governance frameworks and institutional policies that embed resilience as an organizational imperative
- Mandate better management of configuration drift, and the balance between resilience investments and cost efficiency — both vital for sustainable operations
- Regulating imperatives for hybrid and multi-cloud adoption options to reduce vendor lock-in and systemically prevent single points of failure
- Data sovereignty and complex compliance landscapes need to be constantly monitored as a matter of regulatory and geopolitical risks
World governments and cloud policy leaders should take the current clarion call to urgently move beyond fragmented, sector-specific guidance to establish a robust, multi-stakeholder global cloud resilience framework.
Such a globally unified cloud reassurance framework should harmonize regulations, clarify oversight responsibilities, promote transparency and trust, and incentivize rigorous, ongoing resilience testing across current and future public cloud ecosystems.
Without coordinated international action and authoritative governance, the growing concentration and criticality of cloud services pose systemic risks to economies and societies worldwide.
Current fragmented regulatory efforts — while important — are insufficient to drive the transformative, comprehensive resilience that modern cloud-reliant infrastructures demand. This calls for an urgent, unified master plan in cloud governance and operational resilience to proactively manage risks and ensure trust in the digital foundations underpinning global commerce and public services.