Welcome to another edition of the Outage Roundup newsletter.

Headlines

AWS Middle East Drone Strike Outage

As of this writing, AWS has confirmed that both their Middle East (UAE) Region (ME-CENTRAL-1) and the AWS Middle East (Bahrain) Region (ME-SOUTH-1) have been affected as a result of drone strikes in the ongoing Middle East conflict.

AWS recommends that customers with workloads running in the Middle East should attempt to migrate your workloads to alternate regions, although this might be challenging for customers, including other vendors, that have data residency requirements. Examples of such vendors include Slack and their parent Salesforce. Salesforce has taken measures to create secondary backups of UAE-based customer data in their Sweden region.

AWS’s official statements:

The AWS Christmas Eve outage in 2012 that affected Netflix brought widespread publicity to the perils of depending on a single region (us-east-1 in that case). The ongoing outage is probably the first to add wartime attack to the risk register.

Widespread LLM Issues with Downstream Effects

The last two weeks have not been great for LLM providers in terms of reliability. The most affected included Claude, OpenAI, and Character AI.

In these last two weeks, Anthropic experienced a total of 40 major and minor incidents, and around 402 in the past one year. As of this writing, it has not recovered yet from issues with Claude Opus 4.6.

Problems with Claude’s Haiku 4.5 also caused downstream errors with GitHub Copilot on Feb 23, lasting more than 1 hour. Similarly, problems with Anthropic’s models on two subsequent days disrupted Cursor IDE users workflows.

This was a few days after GitHub Copilot users experienced degraded availability for the GPT-5.1-Codex model in Copilot Chat, VS Code, Visual Studio, and JetBrains integrations on February 20.

On March 1, OpenAI began investigating increased authentication failures affecting some ChatGPT users. A mitigation was applied at 07:29 UTC and all impacted services fully recovered by 02:08 UTC on March 2 - a total incident window of approximately 20 hours.

As AI-assisted coding services embed multiple third-party model providers, their failure surface area is multiplying. The upstream (LLM) vendor attribution in outage notices is becoming common and it’s worth studying further to build redundancy in such tools.

Supabase Blocked in India

On February 24, Supabase began receiving reports from users in India unable to connect to projects. Investigation confirmed that ISP DNS servers in India were not returning correct responses for Supabase domains. Supabase infrastructure itself remained fully operational. Suggested workarounds included switching to alternative DNS resolvers (Cloudflare 1.1.1.1, Google 8.8.8.8) or using a VPN - all of which are impractical for end users.

India is Supabase’s 4th largest market, and this block affected numerous developers, startups, and production apps. There is still no transparency behind the reasons for the blockage. The incident is still active as of this writing.

Users in India on February 27 also experienced packet loss and degraded networking performance across multiple compute and edge delivery services in Akamai’s India region, caused by a problem with a third-party service provider.

Railway — Edge Network DNS (All Regions)

Railway experienced a series of networking disruptions between February 18 and 21 caused by repeated DDoS attacks, compounded by degraded upstream network capacity from a fiber cut. Part of the mitigation included engaging Fastly to rollout a WAF (Web Application Firewall) for all Railway customers.

The combination of multiple, continued attacks, upstream vendor failures for DDoS mitigation, Cloudflare’s BGP prefix outage which affected Railway, and some side effects of rolling out the WAF under emergency conditions all resulted in prolonged disruption for its users. This comes after the previous Railway outages which happened due to Cloudflare.

Every outage is an opportunity to learn and make things better - and Railway demonstrates this in their detailed write-up outlining preventative measures for such occurrences.

“Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every "mistake" is seen as an opportunity to strengthen the system.” - The SRE Book

GitHub — Pull Requests Dashboard

On March 2, GitHub users could not filter pull requests on the /pulls dashboard. GitHub identified a mitigation and deployed a fix, estimating full rollout across all regions within 60 minutes. The incident was resolved at 22:04 UTC after approximately three hours.

Outages By Service Type in the Last 2 Weeks

In Case You Missed It…

The Definitive AWS Outage Report 2025: Reliability Analytics and Cascade Impact analyzes AWS reliability in detail in 2025 and drills down into cascading effects of last year’s 20 Oct outage.

Did You Know? It will soon be 10 years since the Dyn DNS service faced one of the biggest DDoS attacks in terms of impact on other services and websites. The attack was carried out using a botnet infected with the Mirai malware.

Till next time,

Hrish from Outage Roundup

Visualizations and tabular data in this newsletter are derived from IncidentHub’s third-party status monitoring. IncidentHub monitors status pages of hundreds of SaaS and Cloud services.

Outage Roundup - 3 March 2026