Istio Ambient: Simplifying Service Mesh in UDS Core

Micah Nagel's profile picture
Micah Nagel
Unicorn Engineer

UDS Core Graphic

UDS Core provides a FOSS secure runtime platform for mission-critical applications. Some of the capabilities provided by UDS Core include logging, monitoring, SSO, and a service mesh. Istio has always been a key part of this platform stack to fulfill our service mesh needs (dating back to the DoD Enterprise DevSecOps Reference Design). However, its traditional "sidecar" approach can be resource-intensive and complex, especially in edge environments.

Istio's Ambient mode promises lower resource requirements, simplified operations, and improved latency - all significant benefits for our platform, particularly as we expand our edge capabilities. In UDS Core, we've been incrementally building out support for Istio Ambient and as of release v0.42.0 all of the core workloads are running in ambient mode, with an opt-in approach for applications deployed on the UDS platform.

Background: Service Mesh and Sidecar vs Ambient

Istio is a service mesh designed to provide secure, encrypted communications between workloads. For UDS Core, this is critical to satisfy security and compliance controls (NIST 800-53) around encryption in transit, and to align with the DoD Zero Trust Strategy. Historically, Istio configured and provided this security via per-pod sidecars running Envoy proxies. While these provide the necessary security, each sidecar proxy comes with nontrivial resource overhead, complexity, and operational maintenance.

Istio first announced a new approach for service mesh, "Ambient mode", in September 2022, reaching general availability in November 2024. Ambient provides an alternative approach using a shared, per-node proxy for traffic called ztunnel, which is able to handle processing of Layer 4. For more advanced processing at Layer 7, Istio Waypoints can be optionally added as needed. This approach maintains the same standards of encryption and security, but reduces the overhead significantly by eliminating per-pod proxies.

Sidecar Mode Ambient Mode
Deployment Model Per-pod Envoy sidecar Per-node ztunnel (optional Waypoints for L7)
Resource Overhead High (scales with number of pods) Low (proxies shared at node level)
Latency Impact Moderate to high Lower (fewer processing hops)
Operational Complexity High (sidecar injection, lifecycle management) Lower (transparent to pods, simpler onboarding)
Layer 4/Layer 7 Processing Full L4/L7 by default (sidecar) L4 always (ztunnel), L7 optional (Waypoints)

Istio provides a full comparison of these two service mesh modes, along with more details on Layer 4 and Layer 7 processing if you'd like to read more here.

Why Ambient for UDS Core?

The main driver for adopting Istio's ambient mode was the reduced resource footprint. As we continue to focus on providing more mission impact, we've been expanding our offering at the tactical edge. One common characteristic we see of "edge devices" is limited resources, especially when compared to the "infinite scalability" of workloads in the cloud. In order to validate some our assumptions about ambient benefits, we gathered some initial metrics with a test application.

Based on this initial evaluation we saw significant resource reductions, especially with larger pod counts. In our testing each sidecar required an average of 8 CPU millicores and 40Mi of memory. In ambient we saw roughly the same resource cost per ztunnel proxy. However, since the ztunnel proxy is a daemonset rather than a sidecar, this cost was per-node rather than per-pod, so the resource overhead was only a fraction of sidecar mode (see full metrics below based on a deployment of the full UDS Core platform).

Istio sidecar integration has frequently caused friction for end users, pulling their focus away from their production applications (we typically call these mission apps). Historically, our UDS Operator has mitigated most of this complexity, but some applications still required special configurations. A few specific places we've seen complexity in the past:

  • Sidecar injection complexities: Since sidecars run as a container in each application pod, managing the lifecycle of these (adding, updating, removing) typically requires restarting the application pod. This can cause disruptions to your application, adds to initial startup/install time, and increases the overall complexity of adding applications to the mesh. While no longer relevant for UDS Core, there were also issues managing sidecars on Kubernetes Jobs until "native sidecars" became generally available in Kubernetes. Istio ambient removes these sidecars on applications, removing this as a management concern.
  • Direct pod addressability: With sidecar mode and STRICT mTLS, workarounds were required to allow pod IPs to be directly addressed for traffic in the mesh. While this wasn't a frequent problem it would pop up from time to time with new applications and tended to be a pain to identify and workaround, especially with other operators managing resources. With the ambient ztunnel proxy this is no longer an issue - pod IPs can be directly routed to through ztunnel.
  • Prometheus metrics setup: The complexity involved with metrics scraping in STRICT mTLS is well documented by Istio but required specific certificate mounts and proper scrape configs to be set for all metrics in the cluster. We managed this all via monitor mutations in the past, but with ambient mode prometheus "just works" even with STRICT mTLS.

With the launch of Istio Ambient as GA and significant interest around the broader cloud native community, we also saw Ambient as the clear future. By being relatively early adopters (especially in the DoD space), we took the opportunity to identify and resolve any issues early on, and provide the best service mesh experience for users of UDS Core.

Implementing Ambient in UDS Core

Utulization comparison graph

As we worked through implementing ambient in UDS Core, we encountered some initial technical challenges/changes that required updates to our baseline "security profile" applied by our UDS Operator. The majority of these changes were with Network Policies and adapting to the ways Network Policies need to change in support of Istio Ambient.

In order to allow traffic to flow through ztunnel as expected, Network Policies need to allow port 15008 traffic for both ingress and egress. Without this port allowed, traffic ends up getting blocked rather than traversing the mesh. However, once port 15008 is opened up through Network Policies - it effectively opens up traffic on any port between the source and destination, since all traffic (regardless of port) is sent through ztunnel.

Resolving this issue took a lot of careful planning and design, ultimately leading to a "defense in depth" approach using Network Policies for some restrictions and layering in Istio Authorization Policies to implement per-port restrictions (we'll provide a deeper dive into this setup in the future). By leveraging our existing "generic" network.allow specification, we were able to fully generate Authorization Policies for traffic without requiring end user changes. The operator also dynamically injects port 15008 into all relevant NetworkPolicies to ensure ztunnel traffic flows as expected, without requiring the user to think through Istio internals.

After navigating these challenges around network security, we were able to incrementally migrate all workloads within UDS Core (logging, monitoring, SSO, etc) to ambient mode. We also provided an opt-in approach for end user workloads, allowing them to adopt ambient mode on their own timeline. This approach offered the best flexibility for end users while minimizing any service disruption and mission impact.

Results & Final Thoughts

After navigating these technical challenges and considerations, how has Ambient mode performed in practice? Overall, the transition has been a major success, resulting in significant reductions in resource consumption, latency, and deployment complexity:

Mode Avg. CPU Used Avg. Memory Used Avg. Request Latency Deployment Time
Sidecar 2393m 22241Mi 9.05 ms 23m 29s
Ambient 1734m 18785Mi 3.23 ms 18m 04s
Percent Reduction 29% 15% 64% 13%

*Metrics gathered based on a 5 node cluster running on a cloud provider

The metrics above represent just the footprint reduction in UDS Core. Migrating any applications on top of UDS Core to ambient mode as well brings even more benefits, potentially translating to $100s or more in cloud cost savings per month. While these cost savings in traditional cloud environments are great, the true impact of this reduced footprint is amplified at the tactical edge. Previously, the full capabilities of UDS Core such as logging, monitoring, and runtime security were inaccessible in resource-constrained environments due to sidecar overhead and complexity. Ambient mode removes this barrier, allowing us to deliver the same secure platform regardless of the underlying hardware.

A key takeaway from this transition is the value of automation and abstraction with our UDS Operator. By managing complexity in our operator, we ensured that end users remained focused on their mission applications, rather than dealing with the intricate details of network security and mesh migration. The use of a clearly defined custom resources allowed us to transparently adjust underlying security enforcement mechanisms (such as integrating Authorization Policies alongside traditional Network Policies), without requiring any action from end users.

Another key to our successful transition was robust end-to-end testing. Comprehensive testing across multiple Kubernetes distributions surfaced issues early, helping us confidently deploy Ambient mode without concerns about unanticipated edge cases.

Lastly, deliberate planning and careful technical design were fundamental to our successful Ambient adoption. Our journey began over eight months ago, steadily ramping up through rigorous testing, iterative refinement, and detailed planning to ensure a smooth transition. This measured approach allowed us to minimize disruption and maximize mission impact.

Adopting Istio Ambient mode has been more than a technical improvement - it represents a strategic evolution of our platform, directly enhancing our ability to deliver secure, reliable, and effective support to missions at any scale or location.

Related Articles


No related articles found.