Military Application Resilience
Emulating real-world network conditions to verify applications & systems
AI workloads push the data centre infrastructure to its limits because they behave nothing like traditional enterprise traffic.
They surge, they fan out. They are mostly local within the data centre scale-in / scale-out network, not across LAN, thus with extremely short round trip times (RTT ca. 10us). They demand exact timing, to complete distributed computation jobs. They saturate links instantaneously. And they are short-lived, lasting only a few RTTs.
As GPU clusters scale and RDMA becomes the norm, legacy Ethernet cracks under the pressure: hotspots, dropped packets, flow imbalance. It’s not enough to bolt on better congestion control or faster NICs.
Ultra Ethernet (UE) stands by RDMA but revisits the protocol which implements it over Ethernet. UE steps in not as an upgrade, but as a fundamental rethink. It brings a transport layer designed for high-performance fabrics. It goes beyond link speed, to unlock end-to-end efficiency, from network directly to application memory.
Let’s walk through what makes it different, and how it works together to solve a complex problem set.
Solve fabric hotspots
At the core of UE is a new transport protocol that can distribute RDMA traffic across multiple underlying paths without breaking memory consistency. This is made possible through packet spraying, i.e. splitting traffic across all viable routes, to maximize fabric utilization.
But this isn’t old-school ECMP. Packet spraying uses real-time congestion information to steer packets. Reordering is managed by sequence numbers at the Packet Delivery Sublayer (PDS), ensuring the destination delivers data in order when required.
To serve RDMA semantics for read/write operations, each packet also carries information about where exactly it should land in memory. This ensures the destination GPU knows precisely what to do, where. The system achieves high throughput and even load distribution while keeping RDMA semantics intact.
Result: no oversubscribed links or fabric hotspots under large flows.
Maintain stability in the face of congestion
Spraying helps, but congestion can still happen. The conventional response of dropping packets is detrimental for short, wire-speed, low-latency flows. Ultra Ethernet introduces a subtle but powerful technique: packet trimming.
Instead of dropping packets, the switch
Sounds counter-intuitive? This is actually a very smart approach. This signal tells the receiving host exactly what happened, fast.
Result: congestion relief without breaking the flow, without the heavy penalty of loss-based recovery. So how is this done?
Fast Recovery that doesn’t break the flow
AI flows can’t afford a heavy backoff and a slow start. Often, the entire transfer completes in a handful of RTTs. Waiting for congestion windows to grow conservatively wastes precious time. UE’s design prioritizes fast ramp-up, bringing transmission rates quickly back to wire speed after trimming events, rather than falling into prolonged recovery cycles.
This is essential because, in many cases, there can be no second chance, especially with inference flows that must complete in milliseconds.
Congestion Control is a two-way conversation
UE supports dual-mode congestion control:
Congestion feedback is still carried back to the sender in ACK or NACK packets, replacing TCP’s ECN echo bits with a richer congestion state.
This pairing allows the system to be both optimistic and responsive. It transmits proactively, before receiving credits, but can rein in pressure quickly when congestion builds.
No handshake? No problem
Time matters, especially for short-lived connections. UE skips the traditional three-way handshake. The first packet carries both setup and data and establishes the connection. There’s no further connection state to maintain or round trips to wait for.
Result: an ultra-fast setup, perfect for microsecond-scale exchanges.
Link-level assurance
All this innovation could be undone by unreliable links. UE addresses that too. Where needed, optional Link Layer Retry (LLR) provides hardware-level retransmission with local L2 ACK/NACK, recovering errors without CPU involvement. This complements PDS-level reliability via ACK/NACK packets.
Result: A dropped packet isn’t a crisis, it is transparently recovered, without pulling CPUs into the mix.
Final pieces: security and software fit
Security is not layered on after the fact; it’s part of the model. Applications can define trusted groups, authenticated by a group key. It’s a natural fit for multi-tenant data centres and environments where isolation and provenance matter.
And because UE integrates cleanly with libfabric, the widely-used RDMA user space API, application developers aren’t left out.
Result: Low-friction adoption in HPC, AI, and cloud-native environments alike.
The big picture
Each of these innovations is impressive on its own, but together they form a cohesive system: a transport architecture that understands modern AI traffic end-to-end. Ultra Ethernet doesn’t just push more bits faster. It brings precision, predictability, and composability to the data centre network stack.
This is how networks keep up with AI.
At Calnex, we believe the most performant AI data centre won’t come from optimising the network or the protocol stack in isolation, rather from tuning both in tandem. UE introduces powerful, layered, intertwined mechanisms. The flexibility it offers at the link, transport, and application layers, creates a paradox of choice. With so many tunable parameters, finding the optimal setup becomes a high-stakes challenge.
That’s where Calnex steps in. Our expertise in precision network testing and validation ensures that every layer of your AI fabric performs as intended, under realistic network conditions. In a world where milliseconds define competitiveness, testing every aspect of the network and understanding the interplay of all the parameters is key, to unlock the full potential of your business.