Continued runtime degraded performance

Incident Report for Estuary

Resolved

This incident has been resolved.

Posted Mar 15, 2025 - 23:09 UTC

Identified

Starting yesterday, we've seen intermittent but escalating failures of DNS resolution in the primary public data-plane, which runs in a Google Kubernetes Cluster. We've traced the problem back to low-level Google-managed components within that cluster and have been engaging their support. Thus far, their recommendations have unfortunately made the problem worse today and we're currently seeing further-elevated task errors due to failures of DNS resolution. We're escalating as much as we can and going back and forth with Google support -- this has been frustrating because we're fairly beholden to them, given how deep the issue is within the bowels of the Google-managed environment, but we'll provide updates as we can.

Private data-planes, as well as the EU data-plane, are not affected. We've been planning a migration off of this legacy Kubernetes cluster to our new data-plane infrastructure, but unfortunately aren't yet in a position to kick it off.

This is ongoing for a very small percentage (less than 5%) of tasks and Google’s fixes didn’t fully help.

Posted Mar 15, 2025 - 01:44 UTC

This incident affected: Runtime.