Operating Gateway API in Production: What the Migration Guides Don't Cover

dev.to

You migrated. Traffic is flowing. ReferenceGrants are in place. The controller reconciliation loop is clean. And then — quietly, without a single alert firing — things start breaking in ways your observability stack was never built to see.

Most Gateway API migration guides end at cutover. That is the wrong place to stop. The real operational surface of gateway API production begins exactly where those guides close — and it is governed by a different set of failure physics than anything Ingress introduced.

The thesis is explicit: Gateway API doesn't just change how traffic is routed. It changes where routing failures live — and how invisible they become.


The Gap Nobody Talks About

Part 0 was the decision. Part 1 was the shift. Part 2 was the migration. Part 3 is the reality.

When you ran Ingress, failures were infrastructure-visible. A misconfigured annotation broke routing and your logs showed it. A missing backend returned a 502 and your alerting fired. The failure surface was shallow and legible.

Gateway API moves routing failures into the decision layer. HTTPRoutes can be accepted by the controller — syntactically valid, status condition green — while silently misrouting traffic. ReferenceGrants can be deleted during a routine namespace cleanup with no downstream alert. Header matching logic from the annotation era doesn't translate 1:1, and the mismatch produces no error. It just routes incorrectly.

This is not a tooling gap. It is an architectural one.


Observability: What Changes After Gateway API

Ingress failures were infrastructure-visible. Gateway API failures are decision-layer invisible.

Understanding what your monitoring stack actually covers requires mapping it against three distinct layers:

Layer 1 — Controller Metrics (What You Get)

Standard Prometheus scraping covers the controller layer. Reconciliation loop latency, controller health, memory and CPU. This is the layer most teams think of as "Gateway API observability" — and it is the least useful layer for diagnosing production routing failures. A healthy controller reconciliation loop tells you nothing about whether the routing decision it produced is correct.

Layer 2 — Spec State (What You Miss)

HTTPRoute status fields are not surfaced by default in most monitoring stacks. The conditions you need to be watching — Accepted, ResolvedRefs, Parents — exist in the Kubernetes API but require explicit instrumentation. A route in Accepted: True with a backend in ResolvedRefs: False will route requests to nothing — and your controller metrics will show green the entire time.

Layer 3 — Runtime Behavior (What Actually Matters)

Routing outcomes, backend selection, header and path matching decisions. 200 OK is the new 500: a request that returns a success status from the wrong backend is operationally identical to a silent outage. Runtime behavior requires traffic-level instrumentation — service mesh telemetry, eBPF-based flow data, or access log enrichment — to become visible.

Your monitoring stack sees the controller. It does not see the routing decision.

Policy Enforcement at the Gateway Layer


Gateway API introduces routing-level trust boundaries, not just network boundaries. The real shift is temporal:

  • NetworkPolicy → Packet-level, always-on
  • OPA / Gatekeeper / Kyverno → Admission-time, pre-deploy
  • Gateway API → Runtime routing authorization, request-time ReferenceGrant is not configuration. It is a security boundary.

A ReferenceGrant deletion — which can happen silently during namespace cleanup, RBAC rotation, or automated resource pruning — immediately collapses cross-namespace routing trust. There is no deprecation window. Traffic stops reaching its backend, and the only signal is a ResolvedRefs: False condition that most teams aren't alerting on yet.


The Day-2 Failure Patterns

These are not edge cases. These are the failures teams discover in the first 30–60 days of production.


Failure Mode 01 — Route Accepted, Traffic Misrouted
Accepted: True means valid configuration — not correct behavior. Backend weight misconfiguration, path prefix overlap, or header match ordering errors produce accepted routes that route to the wrong destination. No alerts fire. Traffic just goes somewhere wrong.

Failure Mode 02 — Cross-Namespace Trust Collapse
ReferenceGrant deleted during routine cleanup. Cross-namespace routing immediately fails. The backend is healthy, the controller is healthy, the HTTPRoute status goes ResolvedRefs: False and traffic stops. Recovery requires manual ReferenceGrant reconstruction.

Failure Mode 03 — Header Routing Regression
Annotation-era header logic doesn't translate 1:1 to HTTPRoute match semantics. The route is accepted, the match appears correct in the spec, and the wrong backend receives traffic silently.

Failure Mode 04 — Controller Version Skew
Gateway API evolves faster than most controller upgrade cycles. HTTPRoutes that reference unsupported features are accepted but silently not enforced — the spec says it should work, the controller says nothing, and behavior is undefined.

Failure Mode 05 — TLS Cert Rotation Gap
cert-manager and Gateway API have different mental models of certificate binding. Rotation timing mismatches produce TLS termination failures that appear as backend connectivity issues — not certificate errors — in most monitoring stacks.


Multi-Cluster and Multi-Tenant Considerations

Gateway API simplifies single-cluster routing. It complicates multi-cluster ownership.

The fundamental shift at multi-tenant scale: the problem is no longer routing. The problem is who is allowed to define routes.

Gateway-per-team is the operationally cleaner model for most enterprises — blast radius is contained, ReferenceGrant surface is minimal. The shared Gateway model reduces resource overhead but introduces a ReferenceGrant audit problem at scale that platform engineering needs to own, not application teams.

Cross-cluster route federation remains experimental. Model it as beta operationally, regardless of what the controller documentation claims.


The Real Problem

Teams think they migrated an ingress layer. What they actually introduced is a new control plane.

This is the thread that runs through the entire series. The control plane shift isn't a Gateway API phenomenon — it is the defining architectural pattern of this infrastructure era. Every layer that used to be configuration is now a control plane: service meshes, policy engines, GitOps operators, and now routing.

The teams that operate Gateway API well in production are not the ones with the best controllers. They are the ones that rebuilt their observability model before they needed it.

Gateway API doesn't fail loudly. It fails in decisions your tooling doesn't see.


Architect's Verdict

Part 0 was the decision. Part 1 was the shift. Part 2 was the migration. Part 3 is the reality — and the reality is that Gateway API production operations require a fundamentally different observability model, a new policy enforcement layer, and an audit discipline that didn't exist when you were running Ingress.

DO:

  • Treat Gateway API as a control plane layer — instrument routing decisions, not just traffic
  • Alert on HTTPRoute status conditions — ResolvedRefs: False is a production incident
  • Audit ReferenceGrants continuously — treat deletions as security boundary changes, not cleanup
  • Pin controller versions to the Gateway API channel they implement — track skew explicitly
  • Own the ReferenceGrant audit function at the platform engineering layer DON'T:
  • Assume Accepted: True means working — it means syntactically valid configuration
  • Treat migration as completion — cutover is the start of the operational surface, not the end
  • Let controller behavior drift from spec assumptions
  • Port Ingress annotation logic directly to HTTPRoute without verifying match semantics
  • Trust cross-cluster Gateway API federation claims without verifying your controller's implementation channel

Architecture diagrams and full failure mode breakdown at rack2cloud.com


Series:

Source: dev.to

arrow_back Back to News