Inference Network Test – Primer

Despoina Triantafyllidou
06 May 2026
Artificial Intelligence
Inference Network Test – Primer

What is AI inferencing? 

Inferencing is the monetisation phase of AI. It is the process where an already trained AI model is executed to produce real time outputs using its learned patterns. Unlike training, which is compute-heavy and offline, inference is latency-sensitive, interactive, and directly tied to user experience and revenue. Inference is a service offering and tied to SLAs. 

What’s the difference – AI inferencing vs. AI model servicing? 

Inference is the core compute operation. It refers specifically to the execution of the trained model on new input data, to generate a prediction or result. 

AI model servicing is the full-stack infrastructure and set of services required to deliver inference as a dependable, production-ready service. The network plays a critical role in servicing, because it governs performance metrics such as the request/response latency, stability, and throughput for end users. 

The most critical challenges in AI model servicing are often network-related (latency, jitter, loss), which is where Calnex’s network performance validation and impairment emulation is essential. 

What is an inference/inferencing network? 

An inferencing network is the entire end-to-end communication path that connects the source of data (user device, sensor, camera) to the server hosting the trained AI model, and back again to deliver the result. 

It spans the access network (e.g., 5G, WiFi), the aggregation/transport network, and the data centre network (ToR/leaf/spine) where the inference server resides. 

Calnex Focus: The most vulnerable and critical part for performance is the network boundary (edge to data centre core). This is where congestion, shaping, and external network impairments are introduced, directly impacting the final service quality (QoS) and customer SLA. This is where we focus our validation efforts. Depicted below: 

What’s the difference between training and inference clusters? 

Parameter  Distributed Training cluster  Production Inference cluster 
Objective  Build model or update model accuracy  Render predictions from already trained model 
Infrastructure  Very large network with up to hundreds of thousands GPU hosts  Smaller network (up to tens) of CPU or GPU hosts​. 
Type of workload  Model training is offline. Long-running, synchronous job.  Inference is typically a real-time service and requires real-time response. Computations are short-lived, synchronous to user queries. 
Key performance metric  Training time of a model, job completion time.  High availability and low latency 
Bandwidth demand  Extreme, sustained, predictable  Moderate–high, bursty, demand-driven 
Flow characteristics  Few very large, long-lived flows  Short flows which may be concurrent. Heavy uplink. 
Deployment location  Centralised hyperscale data centre  Data centre, edge, and could also be regional POPs 
Transport type/protocol  RDMA (RoCEv2 / IB)  TCP, gRPC, can be RDMA for larger clusters and distributed inference 
PFC / ECN reliance  Heavy  Needed only if inference served by GPU clusters. Not needed at the boundary. 
Latency and jitter sensitivity  Important but secondary to throughput  Critical (impacts SLA) 
Loss tolerance  Near-zero (RDMA)  Near-zero, again it impacts SLA 
Congestion behaviour  Predictable hotspots during collectives  Unpredictable, request-driven congestion 
Bottleneck  Hotspots during collective communication  At the inference boundary (edge / data centre) 

 

What is edge vs. data-centre inference? 

In edge inference, also called multi-access edge computing (MEC) based, the model runs close to the user or device (source of information, such as mobile device, sensor). This is very often the case in enterprise sites and telco networks). 

In data-centre inference, the model runs in a regional or cloud data centre. (See Difference Between Cloud and Data Center – GeeksforGeeks) 

 

Choosing between edge and data-centre inference requires testing how networks behave under impairments. 

Deployment  Edge  DC core 
Benefits  ultra-low latency, privacy, 

reduced backhaul load 

faster compute (large CPU/GPU clusters), easier scaling 
Constraints  limited compute, cost, and scaling complexity  dependent on WAN performance, higher and less predictable latency, congestion risks at the connectivity boundary 

 

What do telcos/SP/Cloud care about in inferencing? 

Telcos, SP, and Cloud providers have an opportunity to monetise existing infrastructure by serving AI models. They focus on the performance of AI-enabled services that rely on their networks. They care about: 

  • Deterministic latency & jitter for interactive inference workloads (AI-based customer contact centres, voice assistants, industrial automation). 
  • Service differentiation: the ability to sell “AI-ready connectivity” with measurable SLO and guarantees – and charge for these premium services. 
  • Traffic asymmetry & burstiness, as inference is uplink-heavy. 
  • Edge placement decisions, provisioning of equipment and network. 
  • Avoiding deployment failures caused by unpredictable network behaviour at the edge–DC boundary.

What can Calnex do for inference testing? 

1. Test the critical link between the GPU cluster and the user-facing network

2. Decide whether to run inference in the data centre or at the edge.

In Enterprise applications, an AI model generates responses for an end user or machine via an AI agent: a CPU-driven proxy for the trained model.  

For large scale inferencing, the fully trained model shall typically (not necessarily) be loaded onto multiple GPUs (GPU cluster) because it rarely fits on a single GPU. When the AI agent receives a request, the GPUs collaborate, exchanging intermediate results to produce the final answer. As the model generates output, it creates “tokens” in real time, represented to the user as a part of the response. As more tokens are generated, a larger part of the response is delivered. This process requires high performance GPU connectivity and a highly performing network between the GPU cluster and the end user. Inference performance is measured in tokens per second (TPS), while training performance is measured in job completion time (JCT). The critical link between the GPU cluster and the user-facing network is market target #1 for Calnex 

GPUs are today’s most common way to achieve accelerated, parallel compute at scale. Large scale inference, however, doesn’t strictly require GPUs. It can also run on edge AI accelerators, or CPU clusters (slower, less efficient, higher cost per token or watt). This is market target #2 for Calnex: help Telecoms, Service Providers, and Enterprises decide whether to run inference in the data centre or at the edge. 

 Why does impairment emulation help with inference networks? 

Inference workloads react to small disturbances in the network. This type of traffic is not as resilient as typical HTTPS/TCP Internet traffic. Even micro-loss, short-term jitter bursts, or momentary congestion can degrade the service responsiveness or the model accuracy. 

Impairment emulation helps because it allows organisations to: 

  • Recreate and isolate real-world network conditions (loss, delay, jitter, reordering, congestion, asymmetry) without leaving the lab. 
  • Understand how inference workloads behave, not just at peak capacity but under faults, instability, and edge cases. 
  • Define and guarantee SLAs and validate architectural choices (edge vs DC) using evidence, not assumptions. The business can design their service models, from the validated SLOs. 
  • Identify bottlenecks before deployment, reducing field failures, support tickets and costs. 
  • Compare and benchmark different network paths, configurations, and service providers. 
  • Estimate future network or service requirements during network revision cycles. 

Load testing from traffic generators can reveal capacity limits, but alone cannot uncover these behaviours, which only controlled impairment scenarios can. 

What impairments matter most in inference networks? 

  • Latency: because inference is mostly a real-time service, absolute delays affect responsiveness and thus user experience. 
  • Jitter: variation in delay disrupts the service when dealing with multiple sources or synchronised processing of data, for example during batching or GPU scheduling. 
  • Bursty packet loss (when consecutive packets are lost): distributed inference protocols and remote procedure call (RPC) comms degrade with even small loss.  
  • Congestion – queueing effects at the network boundary, where the frontend or access network meets the high-speed core (the topology can be e.g. access network ↔ aggregation layer ↔ DC core). This may be a bottleneck causing tail-latency spikes. 
  • Asymmetry – inference workloads often send small query inputs (uplink) and receive larger output results (downlink). However, in certain applications (like video or sensor data analysis), the input uplink data is massive and sustained. It is essential to test both scenarios, especially the high-volume uplink, which is often neglected in standard network design. 
  • Reordering – impacts flow control and can trigger unnecessary retransmissions. 
  • Rate limiting (throttling) – telco networks may impose shaping and policing, sometimes depending on SLA. 

What is tail latency and why is it a critical problem for inference? 

Tail latency refers to the highest end of the delay distribution (e.g., the 99th or 99.9th percentile) of a service’s response time. 

Consider an AI inferencing application that must fetch data from distributed cache before generating the next token. To fulfil a single user request, the inference engine initiates 100 parallel read operations across the network. While 99 of the read operations complete successfully, the response packet for the 100th read is delayed in a buffer queue. In synchronized AI workloads, the inference step cannot proceed until the read operation is complete, i.e. all packets have been received. 

This single delayed packet stalls the entire inference request. This is like an animal trying to move forward but can’t do so if its tail is stuck somewhere – now the term “tail latency” is easier to remember. (Of course, the actual reason it’s called tail latency is mathematical.) 

The effect of tail latency may be detrimental to the inference job which, if dependent on timeouts, may be aborted or retried. The system must them restart the inference step, reissuing all 100 read operations and rerunning the computation – potentially causing SLA violations.  

To summarise: 

  • The Problem – For inference services, ensuring low average latency is not enough. A few slow requests (the “tail”) can severely degrade the total user experience and SLO compliance. If a user’s voice assistant or an automated decision-making system occasionally takes 500ms instead of 50ms, the service is perceived as broken. 
  • The Cause – Tail latency spikes are often caused by network events like transient congestion, short bursts of packet loss, or queueing delays at network boundaries. All scenarios are difficult to reproduce without controlled impairment emulation. 
  • Our Value – Calnex tools allow customers to reliably measure and drive down the tail latency by stress-testing the inference infrastructure under controlled fault conditions.