Sub-millisecond latency requires RoCE v2 over the ConnectX-7 NICs. Daisy-chaining is not recommended; use a managed 200Gbps QSFP switch. Jumbo frames (MTU 9000) are mandatory to prevent fragmentation.
network:
version: 2
ethernets:
enp1s0f1np1:
addresses:
- 192.168.100.10/24
mtu: 9000
routes:
- to: 192.168.100.0/24
via: 192.168.100.10
metric: 100
Apply with sudo netplan apply on all nodes, incrementing the IP for Spark-Beta (.11), Gamma (.12), and Delta (.13).
Unified memory means the OS shares RAM with the GPU. You must aggressively disable GUI environments, stop NVMe swapping, and drop caches to prevent the Linux OOM killer from terminating the Ray cluster.
# 1. Enforce headless mode
sudo systemctl isolate multi-user.target
sudo systemctl disable gdm lightdm
# 2. Prevent eager memory paging
sudo sysctl vm.swappiness=1
echo "* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
# 3. Drop caches immediately prior to launch
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
You must explicitly bind Ray to the ConnectX-7 interface (enp1s0f1np1). Otherwise, it will route through the slow 10GbE management port, causing catastrophic latency.
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=192.168.100.10
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
ray start --head --node-ip-address=$VLLM_HOST_IP --port=6379 --dashboard-host=0.0.0.0 --num-gpus=1
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
ray start --address='192.168.100.10:6379' --node-ip-address=$VLLM_HOST_IP --num-gpus=1
Launch the OpenAI-compatible server on the Head Node using Docker. Host networking is required to expose the InfiniBand devices to the container.
docker run -it --rm --name=vllm-spark-matrix \
--network=host \
--ipc=host \
--device=/dev/infiniband \
--ulimit memlock=-1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly \
--model Qwen/Qwen3.5-397B-A17B-FP8 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--block-size 128 \
--language-model-only \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml \
--attention-backend FLASHINFER