Military Application Resilience
Emulating real-world network conditions to verify applications & systems
Inferencing is the monetisation phase of AI. It is the process where an already trained AI model is executed to produce real time outputs using its learned patterns. Unlike training, which is compute-heavy and offline, inference is latency-sensitive, interactive, and directly tied to user experience and revenue. Inference is a service offering and tied to SLAs.
Inference is the core compute operation. It refers specifically to the execution of the trained model on new input data, to generate a prediction or result.
AI model servicing is the full-stack infrastructure and set of services required to deliver inference as a dependable, production-ready service. The network plays a critical role in servicing, because it governs performance metrics such as the request/response latency, stability, and throughput for end users.
The most critical challenges in AI model servicing are often network-related (latency, jitter, loss), which is where Calnex’s network performance validation and impairment emulation is essential.
An inferencing network is the entire end-to-end communication path that connects the source of data (user device, sensor, camera) to the server hosting the trained AI model, and back again to deliver the result.
It spans the access network (e.g., 5G, WiFi), the aggregation/transport network, and the data centre network (ToR/leaf/spine) where the inference server resides.
Calnex Focus: The most vulnerable and critical part for performance is the network boundary (edge to data centre core). This is where congestion, shaping, and external network impairments are introduced, directly impacting the final service quality (QoS) and customer SLA. This is where we focus our validation efforts. Depicted below:

| Parameter | Distributed Training cluster | Production Inference cluster |
| Objective | Build model or update model accuracy | Render predictions from already trained model |
| Infrastructure | Very large network with up to hundreds of thousands GPU hosts | Smaller network (up to tens) of CPU or GPU hosts. |
| Type of workload | Model training is offline. Long-running, synchronous job. | Inference is typically a real-time service and requires real-time response. Computations are short-lived, synchronous to user queries. |
| Key performance metric | Training time of a model, job completion time. | High availability and low latency |
| Bandwidth demand | Extreme, sustained, predictable | Moderate–high, bursty, demand-driven |
| Flow characteristics | Few very large, long-lived flows | Short flows which may be concurrent. Heavy uplink. |
| Deployment location | Centralised hyperscale data centre | Data centre, edge, and could also be regional POPs |
| Transport type/protocol | RDMA (RoCEv2 / IB) | TCP, gRPC, can be RDMA for larger clusters and distributed inference |
| PFC / ECN reliance | Heavy | Needed only if inference served by GPU clusters. Not needed at the boundary. |
| Latency and jitter sensitivity | Important but secondary to throughput | Critical (impacts SLA) |
| Loss tolerance | Near-zero (RDMA) | Near-zero, again it impacts SLA |
| Congestion behaviour | Predictable hotspots during collectives | Unpredictable, request-driven congestion |
| Bottleneck | Hotspots during collective communication | At the inference boundary (edge / data centre) |
What is edge vs. data-centre inference?
In edge inference, also called multi-access edge computing (MEC) based, the model runs close to the user or device (source of information, such as mobile device, sensor). This is very often the case in enterprise sites and telco networks).
In data-centre inference, the model runs in a regional or cloud data centre. (See Difference Between Cloud and Data Center – GeeksforGeeks)

Choosing between edge and data-centre inference requires testing how networks behave under impairments.
| Deployment | Edge | DC core |
| Benefits | ultra-low latency, privacy,
reduced backhaul load |
faster compute (large CPU/GPU clusters), easier scaling |
| Constraints | limited compute, cost, and scaling complexity | dependent on WAN performance, higher and less predictable latency, congestion risks at the connectivity boundary |
Telcos, SP, and Cloud providers have an opportunity to monetise existing infrastructure by serving AI models. They focus on the performance of AI-enabled services that rely on their networks. They care about:
What can Calnex do for inference testing?
1. Test the critical link between the GPU cluster and the user-facing network
2. Decide whether to run inference in the data centre or at the edge.
In Enterprise applications, an AI model generates responses for an end user or machine via an AI agent: a CPU-driven proxy for the trained model.
For large scale inferencing, the fully trained model shall typically (not necessarily) be loaded onto multiple GPUs (GPU cluster) because it rarely fits on a single GPU. When the AI agent receives a request, the GPUs collaborate, exchanging intermediate results to produce the final answer. As the model generates output, it creates “tokens” in real time, represented to the user as a part of the response. As more tokens are generated, a larger part of the response is delivered. This process requires high performance GPU connectivity and a highly performing network between the GPU cluster and the end user. Inference performance is measured in tokens per second (TPS), while training performance is measured in job completion time (JCT). The critical link between the GPU cluster and the user-facing network is market target #1 for Calnex.
GPUs are today’s most common way to achieve accelerated, parallel compute at scale. Large scale inference, however, doesn’t strictly require GPUs. It can also run on edge AI accelerators, or CPU clusters (slower, less efficient, higher cost per token or watt). This is market target #2 for Calnex: help Telecoms, Service Providers, and Enterprises decide whether to run inference in the data centre or at the edge.
Inference workloads react to small disturbances in the network. This type of traffic is not as resilient as typical HTTPS/TCP Internet traffic. Even micro-loss, short-term jitter bursts, or momentary congestion can degrade the service responsiveness or the model accuracy.
Impairment emulation helps because it allows organisations to:
Load testing from traffic generators can reveal capacity limits, but alone cannot uncover these behaviours, which only controlled impairment scenarios can.
Tail latency refers to the highest end of the delay distribution (e.g., the 99th or 99.9th percentile) of a service’s response time.
Consider an AI inferencing application that must fetch data from distributed cache before generating the next token. To fulfil a single user request, the inference engine initiates 100 parallel read operations across the network. While 99 of the read operations complete successfully, the response packet for the 100th read is delayed in a buffer queue. In synchronized AI workloads, the inference step cannot proceed until the read operation is complete, i.e. all packets have been received.
This single delayed packet stalls the entire inference request. This is like an animal trying to move forward but can’t do so if its tail is stuck somewhere – now the term “tail latency” is easier to remember. (Of course, the actual reason it’s called tail latency is mathematical.)
The effect of tail latency may be detrimental to the inference job which, if dependent on timeouts, may be aborted or retried. The system must them restart the inference step, reissuing all 100 read operations and rerunning the computation – potentially causing SLA violations.