Distributed Inference on Workstation Blackwell, Part 1: Cluster Bring-Up and Cross-Node Fabric Characterisation
Download PDFAbstract
Workstation-class NVIDIA Blackwell GPUs connected by RDMA over Converged Ethernet (RoCE) are evaluated as a deployment target for frontier open-weight large language models in the absence of NVLink or InfiniBand. A three-GPU, two-node cluster of NVIDIA RTX PRO 6000 Blackwell GPUs connected by a 200 Gbps RoCE fabric is brought up, characterised, and exercised with a first cross-node inference run of Llama 3.1 405B Instruct quantised to AWQ-INT4, under pipeline-parallel vLLM orchestrated by Ray. NCCL all-reduce bus bandwidth reaches 23.0 GB/s on the inter-node path, approximately 92 percent of the 200 Gbps line rate; iperf3 measures 198 Gbps on the same link, approximately 99 percent of line rate. GPUDirect RDMA is verified active through NCCL debug logs. The 405-billion parameter model loads and generates across three GPUs split as three pipeline stages of 42 decoder layers each. This paper is Part 1 of an ongoing research programme characterising distributed inference of frontier open-weight models on workstation-class hardware.
1. The Question
Frontier open-weight large language models are released at parameter counts and memory footprints that presuppose deployment on hyperscaler-grade infrastructure: DGX systems, HGX B200 platforms, dense NVLink fabrics, InfiniBand interconnect. The reference deployments and vendor benchmarks for Llama 3.1 405B, Nemotron Super 120B, Qwen3 235B and the forthcoming Nemotron Ultra 500B are all specified in that domain.
A distinct and commercially relevant segment, however, cannot or will not procure HGX-class infrastructure. Research laboratories, regulated verticals, sovereign deployments, and mid-size enterprises face capital, operational, data-residency or regulatory constraints that rule out hyperscaler-tier gear. What this segment can procure is workstation-class professional GPUs in tier quantities and commodity high-speed Ethernet fabric. Whether frontier models are practically deployable at useful throughput on such topologies, in the explicit absence of NVLink and InfiniBand, is an open empirical question.
2. Scope of Part 1
Part 1 of the programme is deliberately narrow. Its purpose is to establish that the baseline stack works end to end, not to quote performance headlines. The four Part 1 objectives are:
(i) Bring up the cluster. Three NVIDIA RTX PRO 6000 Blackwell GPUs distributed across two nodes, interconnected by a 200 Gbps RoCE fabric through a single Ethernet switch. No NVLink. No InfiniBand.
(ii) Characterise the fabric. Report raw Ethernet throughput and NCCL collective bus bandwidth as separate, independently measured quantities.
(iii) Verify GPUDirect RDMA. Confirm, through NCCL debug log inspection, that GPU-to-NIC direct memory transfers are active and that the cluster is not silently host-staging.
(iv) Load a frontier model. Deploy Llama 3.1 405B Instruct (AWQ-INT4 quantisation) across the three-GPU pipeline and verify coherent generation end to end.
3. Fabric Measurements
Two independent measurements are taken on the inter-node link and reported separately. The distinction matters: iperf3 bounds what the Ethernet link can carry host-to-host; NCCL bus bandwidth bounds what a collective inference or training run will see in practice. Merging them into a single "wire rate" figure would conflate two different properties of the fabric.
| Measurement | Tool | Result | Relative to 200 Gbps | |---|---|---|---| | TCP throughput | iperf3, 8 streams | 198 Gbps | approximately 99 percent | | NCCL bus bandwidth | all_reduce_perf | 23.0 GB/s (approx. 184 Gbps) | approximately 92 percent | | GPUDirect RDMA | NCCL debug log | Enabled | verified |
The 92 percent NCCL-over-RoCE efficiency is broadly in line with public characterisations of equivalent Mellanox-based fabrics. The gap between iperf3 line rate and NCCL bus bandwidth reflects NCCL protocol overhead, synchronisation, and ring-descriptor exchange, rather than fabric saturation. At 200 Gbps, cross-node NCCL collectives on this cluster are not fabric-bound.
4. GPUDirect RDMA is Not Automatic
NCCL configuration for non-NVLink topologies is not default-correct. The settings required to activate GPUDirect RDMA over RoCE are:
- NCCL_IB_DISABLE=0 - NCCL_NET_GDR_LEVEL=5 - NCCL_IB_HCA=mlx5_0 - NCCL_IB_GID_INDEX=3 - NCCL_SOCKET_IFNAME set per node - NCCL_BUFFSIZE=8388608 - NCCL_DEBUG=INFO
Omitting any of these does not produce an error. NCCL silently falls back to host-staged transfers, the cluster appears to function, and bus bandwidth collapses into the single-digit GB/s range. Operators migrating from NVLink-native references to RoCE should treat GPUDirect RDMA activation as a defensible configuration step requiring log-level verification, not a default behaviour. Runs that do not log "GPU Direct RDMA Enabled" in NCCL debug output are discarded.
5. First Cross-Node Inference
Llama 3.1 405B Instruct quantised to AWQ-INT4 is approximately 215 GB on disk. It is deployed across the three-GPU cluster in a pipeline-parallel configuration: 126 decoder layers split evenly into three stages of 42 layers each. Tensor parallelism across three GPUs is precluded because 128 attention heads are not evenly divisible by three; pipeline parallelism is the natural choice. The first two stages reside on Node A (intra-node p2p over PCIe Gen 5); the third stage resides on Node B (cross-node p2p over the RoCE fabric). Serving is handled by vLLM with Ray as the distributed executor backend.
This configuration places the primary inter-stage communication on NCCL p2p send/recv. Pipeline parallelism does not use all-reduce in the steady state; the fabric-level all-reduce measurements in section 3 are therefore a separate fabric-validation exercise, not the communication primitive of the inference run itself.
The model loads across the pipeline without incident. Per-GPU VRAM occupancy is approximately 72 GB for model weights, leaving approximately 24 GB per GPU for KV cache and activation buffers at a bounded maximum context length. Generated output is coherent and consistent against held-out reference prompts. A full tokens-per-second, time-to-first-token and pipeline-bubble analysis of this configuration is deferred to Part 2.
6. Roadmap
Part 1 establishes the baseline. The remainder of the programme is staged to produce a self-contained deliverable at each step:
Part 2. Pipeline-parallel throughput and latency analysis, including pipeline-bubble utilisation and p2p send/recv latency at 200 Gbps, across Llama 3.1 405B, Nemotron Super 120B and Qwen3 235B.
Part 3. Fabric upgrade to 400 Gbps RoCE (ConnectX-7) and characterisation of the 200 Gbps to 400 Gbps delta for distributed inference.
Part 4. Asymmetric-topology parallelism: mixing full Workstation Edition and Max-Q Workstation Edition SKUs in the same distributed configuration.
Part 5. Flagship measurement on Nemotron Ultra 500B once released.
Part 6. Consolidated whitepaper, open benchmark dataset, and conference submission.
All raw benchmark data, harness code, and analysis scripts for the programme will be released under a permissive licence. Third parties running the same harness on comparable hardware produce directly comparable result files.
The full paper, including methodology, detailed fabric measurements, related work, and acknowledgements, is available in the PDF above.